MCP Operations Guide Q4`18 documentation

MCP Operations Guide¶

Preface¶

This documentation provides information on how to use Mirantis products to deploy cloud environments. The information is for reference purposes and is subject to change.

Intended audience¶

This documentation is intended for deployment engineers, system administrators, and developers; it assumes that the reader is already familiar with network and cloud concepts.

Documentation history¶

The following table lists the released revisions of this documentation:

Revision date	Description
February 8, 2019	Q4`18 GA

Introduction¶

This guide outlines the post-deployment Day-2 operations for an MCP cloud. It describes how to configure and manage the MCP components, perform different types of cloud verification, and enable additional features depending on your cloud needs. The guide also contains day-to-day maintenance procedures such as how to backup and restore, update and upgrade, or troubleshoot an MCP cluster.

Provision hardware¶

MCP uses the Ubuntu’s Metal-as-a-Service (MAAS) to provision hardware for an MCP deployment.

MAAS as a bare metal provisioning service requires an IPMI user to manage power state. This should be configured as part of the installation process.

MAAS provides DHCP to the network(s) on which compute nodes reside. Compute nodes then perform PXE boot from the network and you can configure MAAS to provision these PXE booted nodes.

Reinstall MAAS¶

If your MAAS instance is lost or broken, you can reinstall it. This section describes how to install MAAS from the Ubuntu Server 16.04 qcow2 image.

To reinstall MAAS:

Create a cloud-config disk:

For example, create a configuration file named config-drive.yaml:

#cloud-config
debug: True
ssh_pwauth: True
disable_root: false
chpasswd:
  list: |
    root:r00tme
    ubuntu:r00tme
  expire: False

Note

You must change the default password.

Create the configuration drive:

export VM_CONFIG_DISK="/var/lib/libvirt/images/maas/maas-config.iso"
cloud-localds  --hostname maas01 --dsmode local ${VM_CONFIG_DISK}  config-drive.yaml

Create a VM system disk using the preloaded qcow2 image. For example:

wget http://cloud-images.ubuntu.com/releases/xenial/release-20180306/ubuntu-16.04-server-cloudimg-amd64-disk1.img -O \
/var/lib/libvirt/images/maas/maas-system-backend.qcow2

export VM_SOURCE_DISK="/var/lib/libvirt/images/maas/maas-system.qcow2"
qemu-img create -b /var/lib/libvirt/images/maas/maas-system-backend.qcow2 -f qcow2 ${VM_SOURCE_DISK} 100G

Create a VM using the predefine-vm script. For example:
```
export MCP_VERSION="master"
wget https://github.com/Mirantis/mcp-common-scripts/blob/${MCP_VERSION}/predefine-vm/define-cfg01-vm.sh

chmod 0755 define-vm.sh
export VM_NAME="maas01.[CLUSTER_DOMAIN]"
```
Note

You may add other optional variables that have default values and set them according to your deployment configuration. These variables include:
- VM_MGM_BRIDGE_NAME="br-mgm"
- VM_CTL_BRIDGE_NAME="br-ctl"
- VM_MEM_KB="8388608"
- VM_CPUS="4"
The br-mgm and br-ctl values are the names of the Linux bridges. See MCP Deployment Guide: Prerequisites to deploying MCP DriveTrain for details. Custom names can be passed to a VM definition using the VM_MGM_BRIDGE_NAME and VM_CTL_BRIDGE_NAME variables accordingly.
Boot the created VM.
Log in to the VM using the previously defined password.
Proceed with installing MAAS:
```
sudo apt-get install maas
```
Configure MAAS as required to complete the installation.
Verify the installation by opening the MAAS web UI:
```
http://<MAAS-IP-ADDRESS>:5240/MAAS
```
If you have installed MAAS from the packages, create an initial (administrative) user first to log in to the MAAS web UI:
```
sudo maas createadmin --username=<PROFILE> --email=<EMAIL_ADDRESS>
```

Add an SSH key¶

To simplify access to provisioned nodes, add an SSH key that MAAS will deliver when provisioning these nodes.

To add an SSH key:

In the MAAS web UI, open the user page by clicking on your user name at the top-right conner.
Find the SSH keys section.
From the Source drop-down menu, specify the source of an SSH key using one of the options:
- Launchpad, specifying your Launchpad ID
- Github, specifying your Github ID
- Upload, placing an existing public key to the Public key edit box
Click Import.

See also

MAAS Configuration

Add a boot image¶

You can select images with appropriate CPU architectures that MAAS will import, regularly sync, and deploy to the managed nodes.

To add a new boot image:

In the MAAS web UI, open the Images page.
Select the releases you want to make available, as well as any architecture.
Click Save selection.

See also

MAAS Configuration

Add a subnet¶

You can add new networking elements in MAAS such as fabrics, VLANs, subnets, and spaces. MAAS should detect new network elements automatically. Otherwise, you can add them manually.

To add a new subnet:

In the MAAS web UI, open the Subnets page.
Select Subnet in the Add drop-down menu at the top-right corner.
Specify Name, CIDR, Gateway IP, DNS servers, Fabric & VLAN, and Space.
Click Add subnet.

See also

MAAS Networking

Enable DHCP on a VLAN¶

Before enabling DHCP, ensure that you have the MAAS node network interface properly configured to listen to VLAN.

You can use external DHCP or enable MAAS-managed DHCP. Using an external DHCP server for enlistment and commissioning may work but is not supported. High availability also depends upon MAAS-managed DHCP.

To enable MAAS-managed DHCP on a VLAN:

In the MAAS web UI, open the Subnets page.
Click on the VLAN you want to enable DHCP on.
In the VLAN configuration panel, you find DHCP Disabled.
From the Take action drop-down menu at the top-right corner, select the Provide dhcp item.
In the Provide DHCP panel, verify or change the settings for Rack controller, Subnet, Dynamic range start IP, Dynamic range end IP.
Click Provide dhcp.
In the VLAN configuration panel, verify that DHCP is enabled.

See also

Enable device discovery¶

MAAS provides passive and active methods of device discovery.

Passive methods include:

Listening to ARP requests
DNS advertisements

To enable passive device discovery:

In the MAAS web UI, open the MAAS dashboard.
On the MAAS dashboard, turn on the Discovery enabled switch.
Verify if you can see the discovered devices on the MAAS dashboard.

Active methods include active subnet mapping that forces MAAS to discover nodes on all the subnets with enabled active mapping using an active subnet mapping interval value.

To enable active subnet mapping:

In the MAAS web UI, open the Settings page.
Go to the Device Discovery section.
From the drop-down menu, select the value for Active subnet mapping interval.
Open the Subnets page.
Click the subnet you want to enable active mapping on.
In the Subnet summary section, turn on the Active mapping switch.

See also

Device discovery

Add a new node¶

Using MAAS, you can add new nodes in an unattended way called enlistment or manually, when enlistment does not work.

MAAS enlistment uses a combination of DHCP with TFTP and PXE technologies.

Note

To boot a node over PXE, enable netboot or network boot in BIOS.
For KVM virtual machines, specify the boot device as network in the VM configuration file and add the node manually. You need to configure the Virsh power type and provide access to the KVM host as described in BMC Power Types.
To ignore the already deployed machines and avoid issues with the wait_for_ready Salt state failure, you may need to set ignore_deployed_machines to true in the Reclass model:
```
parameters:
  maas:
    region:
      ignore_deployed_machines: true
```

To add a new node manually:

In the MAAS web UI, open the Nodes page.
Click the Add hardware drop-down menu at the top-right corner and select Add machine.
Set Machine name, Domain, Architecture, Minimum Kernel, Zone, MAC Address, Power type.

Note

See Configure power management for more details on power types.
Click Save machine.

MAAS will add the new machine to the list of nodes with the status Comissioning.

See also

Configure power management¶

MAAS supports many types of power control, from standard IPMI to non-standard types such as virsh, VMware, Nova, or even completely manual ones that require operator intervention. While most servers may use their own custom vendor management, for example, iLO or DRAC, standard IPMI controls are also supported, and you can use IPMI as shown in the following example.

To configure IPMI node power management type:

In the MAAS web UI, open the Nodes page.
In the list of nodes, select the one you want to configure.
In the machine configuration window, go to the Power section and click the Edit button at the top-right conner.
Select IPMI for Power type from the drop-down menu.
Specify parameters for the IPMI power type: Power driver, IP address, Power user, Power password, and Power MAC.
Click Save changes.

After saving the changes, MAAS will verify that it can manage the node through IPMI.

See also

BMC Power Types

Commission a new node¶

When you add a new node, MAAS automatically starts commissioning the node once configuration is done. Also, you can commission a node manually.

To commision a new node manually:

In the MAAS web UI, open the Nodes page.
In the list of nodes, click the node you want to configure.
In the node configuration window, click Take action at the top-right conner and select Commission.
Select additional options by setting the appropriate check boxes to allow SSH access and prevent machine from powering off, retain network and storage configuration if you want to preserve data on the node. For example, if a node comes from an existing cloud with an instance on it.
Click Go to start commissioning.
Verify that the node status has changed to Ready and hardware summary in the node configuration window has been filled with values other than zeros, which means commisionining was successful.

Note

Use MAAS CLI to commission a group of machines.

See also

Deploy a node¶

Once a node has been commissioned, you can deploy it.

The deployment operation includes installing an operating system and copying the SSH keys imported to MAAS. As a result, you can access the deployed node through SSH using the default user account ubuntu.

To deploy a node:

In the MAAS web UI, open the Nodes page.
In the list of nodes, verify that the commisioned node is in the Ready state.
Click on the node to open the configuration window.
In the node configuration window, click the Take action button at the top-right conner and select the Deploy item.
Specify the OS, release, and kernel options.
Click Go to start the deployment.

Once the deployment is finished, MAAS will change the node status to Deployed.

See also

Redeploy a node¶

To redeploy a node:

In the MAAS web UI, open the Nodes page.
In the list of nodes, select the node you want to redeploy.
In the node configuration window, click Take action at the top-right coner and select Release.
Verify that the node status has changed to Ready.
Redeploy the node as described in Deploy a node.

Delete a node¶

Warning

Deleting a node in MAAS is a permanent operation. The node will be powered down, and removed from the MAAS database. All existing configuration done on the node such as name, hardware specs, and power control type will be permanently lost. The node can be readded again, however, it will be unrecognized by MAAS. In such case, you will need to add the node as a new one and reconfigure from scratch.

To delete a node:

In the MAAS web UI, open the Nodes page.
In the list of nodes, select the node you want to delete.
In the node configuration window, click the Take action button at the top-right coner and select Delete.
Click Go to confirm the deletion.

SaltStack operations¶

SaltStack is an orchestration and configuration platform that implements the model-driven architecture (MDA) that you can use to turn services configuration described using Reclass models and Salt Formulas into actual services running on nodes.

Salt Minion operations¶

Run a command on a node¶

The Salt integrated cmd.run function is highly flexible and enables the operator to pass nearly any bash command to a node or group of nodes and functions as a simple batch processing tool.

To run a command on a node, execute:

salt '[node]' cmd.run '[cmd]'

List services on a node¶

The Salt integrated service.get_all function shows available services on the node.

To list services on a node, run:

salt '<NODE_NAME>' service.get_all

Restart a service on a node¶

You can use the Salt integrated service.restart function to restart services.

To restart a service on a node, run:

salt '<NODE_NAME>' service.restart <SERVICE_NAME>

Note

If you do not know the name of the service or unsure which services are available, see the List services on a node section.

Verify Minions have joined the Master¶

If you are not sure whether or not a Minion has been joined the Master, verify the output of salt-key. The output of the command lists all known keys in the following states:

Accepted
Nodes in this state have successfully joined the Salt Master
Denied
Nodes in this state have not successfully joined because of a bad, duplicate, or rejected key. Nodes in this state require additional user action to join the Master.
Rejected
Nodes in this state have been explicitly rejected by an administrator.

To verify Minions have joined the Master, run:

salt-key

Example of a system response:

Accepted Keys:
<NODE_NAME>.domain.local
… [snip] ...
Execute salt-key:
Denied Keys:
Unaccepted Keys:
Rejected Keys:

Ping a Minion from the Master¶

You can ping all properly running Salt Minion nodes from the Salt Master. To verify that you have network availability between Salt Minion nodes and the Salt Master node, use the test.ping command.

To ping a Minion from the Master:

salt '<NODE_NAME>.domain.local' test.ping

Example of a system response:

<NODE_NAME>
 True

Salt States operations¶

Salt State is a declarative or imperative representation of a system state.

List available States of a Minion¶

A Salt Minion node can have different States.

To list available States of a Minion, execute on a node:

salt-call state.show_top

Example of a system response:

local:
 ----------
 base:
     - linux
     - ntp
     - salt
     - heka
     - openssh
     - nova
     - opencontrail
     - ceilometer

Apply a State to a Minion¶

You can apply changes to a Minion’s State from the Salt Master.

To apply a State to a Minion, run:

salt '<NODE_NAME>' state.sls <STATE_NAME>

Salt Formula operations¶

Salt Formula is a declarative or imperative representation of a system configuration.

Verify and validate a Salt Formula¶

You can verify and validate a new Salt Formula before applying it by running a quick test for invalid Jinja, YAML, and a Salt state.

To verify a SLS file in a Salt Formula, run:

salt '*' state.show_sls <SLS_FILE_NAME>

To validate the Salt Formula, run in the test-only (dry-run) mode using the test option:

salt '*' state.apply test

See also

Salt Command Line Reference

Apply a Salt Formula¶

This section covers how you can test and apply a Salt Formula.

To apply all configured states (highstate) from a Salt Formula to all Minions, run on the Salt Master:

salt '*' state.apply

Note

This command is equal to:

salt '*' state.highstate

To apply individual SLS files in a Salt Formula, run:

salt '*' state.apply <SLS_FILE_1>,<SLS_FILE_2>

Warning

Applying Salt Formulas on more than 100 nodes may result in numerous failures.

Note

SaltStack runs new states in parallel leading to temporary out of service that may affect end users. To avoid taking down services on all the nodes at the same time, you can stagger highstates in a batch mode.

To apply a Salt Formula on a big number of nodes, for example, more than 100 nodes, follow one of the approaches below.

Use the --batch-size or -b flags to specify the number of nodes to have Salt apply a state in parallel:
```
salt --batch-size <NUMBER_OF_NODES> '*' state.apply
```
Specify a percentage of nodes to apply a highstate on:
```
salt -b <PERCENTAGE> '*' state.apply
```
Use node name conventions in the form of <GROUP>.<NODE_TYPE_NAME><NUM> to run a highstate by a pattern. For example: group1.cmp001:
```
salt 'group1.cmp*' state.highstate
```
Use Node Groups that you can define in the Salt Master configuration file /etc/salt/master. To run a highstate on nodes within a Node Group, run:
```
salt -N <GROUP_NODE> state.apply
```
Use Grains for grouping nodes specifying a grain variable in the /etc/salt/grains configuration file and then specify the grain value in the Salt command to apply a highstate for the nodes that have this grain value assigned:
```
salt -G <GRAIN_NAME>:<GRAIN_VALUE> state.apply
```

Note

You can use --batch-size flag together with Node Groups and Grains. For example:

salt --batch-size 10% -N computes1 state.apply
salt -b 5 -N compute:compute1 state.apply

See also

Replace the Salt Master keys¶

In case your Salt Master keys have been compromised, you can replace both Salt Master CA and RSA SSH keys. The replacement procedure of the Salt Master keys does not affect your cloud environment, only the Salt structure is updated.

**Salt Master keys structure**¶
File path	Description
`/etc/salt/minion.d/_pki.conf`	PKI configuration file pointing to the current Salt Master CA certificate path
`/etc/pki/ca/salt_master_ca/`	Catalog for the Salt Master CA certificate
`/etc/pki/ca/salt_master_ca/ca.crt`	Salt Master CA certificate
`/etc/pki/ca/salt_master_ca/ca.key`	Salt Master CA certificate key
`/etc/pki/ca/salt_master_ca/certs`	Catalog for the Salt minion certificates signed by the Salt Master CA certificate
`/etc/pki/ca/salt_master_ca/certs/XX:XX:XX:XX:XX:XX:XX:XX.crt`	Salt minion certificate signed by CA
`/etc/salt/pki/minion/minion.pub` `/etc/salt/pki/minion/minion.pem` `/etc/salt/pki/minion/minion_master.pub`	Salt Master SSH RSA private and public keys for Salt minion
`/etc/salt/pki/master/master.pem` `/etc/salt/pki/master/master.pub`	Salt Master SSH RSA private and public keys for Salt Master
`/etc/salt/pki/master/minions/ctl01.example.int`	RSA SSH minion key for communication with Salt Master. Equals to `/etc/salt/pki/minion/minion.pub`

Replace the Salt Master and Salt minions SSH RSA keys¶

This section provides the instruction of how to replace the Salt Master and the Salt minions SSH RSA keys.

To replace Salt Master and Salt minions SSH RSA keys:

Log in to the Salt Master node.
Verify that all nodes are available:
```
salt \* test.ping
```

Create classes/cluster/<cluster-name>/infra/minions-maintenance.yml with the following content:

parameters:
  _param:
    char_number_sign: "#"
  linux:
    system:
      file:
        restart-minion.sh:
          name: /usr/local/bin/restart-minion.sh
          user: root
          group: root
          mode: 750
          contents: |
            ${_param:char_number_sign}!/bin/bash
            /usr/sbin/service salt-minion stop
            rm -f /etc/salt/pki/minion/minion*;
            /usr/sbin/service salt-minion start
      job:
        restart-minion:
          enabled: True
          command: /usr/local/bin/restart-minion.sh
          user: root
          minute: '*/5'

Include the minions-maintenance class in the infra/init.yml file:

classes:
...
- cluster.<cluster-name>.infra.minions-maintenance

Put all Salt minions into the maintenance mode:
```
salt \* state.sls linux.system.file,linux.system.job
```
The command above will cause all Salt minions to remove their keys and restart each 5 minutes.

Count your minions:

MINIONS_NUMBER=$(ls /etc/salt/pki/master/minions/ -1 | wc -l)

Verify that all minions are put into the maintenance mode by checking the diff between /master/minions/ and master/minions_denied/:
```
diff <(ls /etc/salt/pki/master/minions/ -1 | wc -l) \
<(ls /etc/salt/pki/master/minions_denied/ -1 | wc -l)
```
Start the verification at the beginning of the zero or fifth minute to have enough time to purge old minions keys. Proceed only if the diff is empty. If you see the diff for more than 10 minutes, some minions are rejected to execute the cron job. Identify the root cause of the issue and resolve it before proceeding.
Stop the Salt Master node:
```
service salt-master stop
```
Change directory to the Salt Master key:
```
cd /etc/salt/pki/master
```
Remove the Salt Master key:
```
rm -f master.p*
```

Generate a new key without a password:

ssh-keygen -t rsa -b 4096 -f master.pem

Remove the RSA public key for the new key as Salt Master does not require it:
```
rm -f master.pem.pub
```
Generate the .pem public key for the Salt Master node:
```
openssl rsa -in master.pem -pubout -out master.pub
```
Note

Press Enter for the empty password.
Remove the minions list on the Salt Master node:
```
salt-key -y -d '*'
```
Start the Salt Master node:
```
service salt-master start
```
Verify that the minions are present:

Note

The minions should register on the first or sixth minute.
```
salt-key -L
```

Verify that the current minions count is the same as in the step 6:

ls /etc/salt/pki/master/minions/ -1 | wc -l
echo $MINIONS_NUMBER

Disable the maintenance mode for minions by disabling the cron job in classes/cluster/<cluster-name>/infra/minions-maintenance.yml:
```
job:
  restart-minion:
    enabled: False
```
Update your minions:
```
salt \* state.sls linux.system.job
```

Remove the minions-maintenance class from the infra/init.yml file:

classes:
...
# Remove the following line
- cluster.<cluster-name>.minions-maintenance

Remove the minions-maintenance pillar definition from the Reclass model:

rm -f classes/cluster/<cluster-name>/infra/minions-maintenance.yml

Replace the Salt Master CA certificates¶

This section provides the instruction on how to replace the Salt Master CA certificates.

To replace the Salt Master CA certificates:

Back up the running Salt configuration in case the rollback is required:

tar cf /root/salt-backup.tar /etc/salt /etc/pki/ca/salt_master_ca/
gzip -9 /root/salt-backup.tar

List all currently issued certificates.

Currently, the index file for Salt Master CA does not exist. Therefore, you can list all certificates and find the latest ones using the salt_cert_list.py script:

Note

The script is available within Mirantis from the mcp-common-scripts GitHub repository.

./salt_cert_list.py

Example of system response:

/etc/pki/ca/salt_master_ca/certs/18:63:9E:A6:F3:7E:10:5F.crt (proxy, 10.20.30.10, horizon.multinode-ha.int)
/etc/pki/ca/salt_master_ca/certs/EB:51:7C:DF:CE:E7:90:52.crt (10.20.30.10, 10.20.30.10, *.10.20.30.10)
/etc/pki/ca/salt_master_ca/certs/15:DF:66:5C:8D:8B:CF:73.crt (internal_proxy, mdb01, mdb01.multinode-ha.int, 192.168.2.116, 192.168.2.115, 10.20.30.10)
/etc/pki/ca/salt_master_ca/certs/04:30:B0:7E:76:98:5C:CC.crt (rabbitmq_server, msg01, msg01.multinode-ha.int)
/etc/pki/ca/salt_master_ca/certs/26:16:E7:51:E4:44:B4:65.crt (mysql_server, 192.168.2.53, 192.168.2.50, dbs03, dbs03.multinode-ha.int)
/etc/pki/ca/salt_master_ca/certs/78:26:2F:6E:2E:FD:6A:42.crt (internal_proxy, ctl02, 192.168.2.12, 10.20.30.10, 192.168.2.10)
...

Update classes/cluster/<cluster_name>/infra/config.yml with the required values for the Salt Master CA. For example:

parameters:
  _param:
    salt_minion_ca_country: us
    salt_minion_ca_locality: New York
    salt_minion_ca_organization: Planet Express
    salt_minion_ca_days_valid_authority: 3650
    salt_minion_ca_days_valid_certificate: 365

Replace the Salt Master CA certificates:

rm -f /etc/pki/ca/salt_master_ca/ca*
salt-call state.sls salt.minion.ca -l debug

Publish the Salt Master CA certificates as described in Publish CA certificates.
Replace the certificates in your cloud environment according to the list of certificates obtained in the step 3 of this procedure as described in Manage certificates for the affected services.

See also

SaltStack components

DriveTrain operations¶

This section describes the main capabilities of DriveTrain, the MCP lifecycle management engine.

Job configuration history¶

The DriveTrain Jenkins provides the capability to inspect the history of jobs configuration changes using the Job Configuration History plugin. This plugin captures and stores the changes to all jobs configured in Jenkins and enables the DriveTrain administrator to view these changes. It allows identifying the date of the change, the user that created the change, and the content of the change itself.

To use the Job Configuration History plugin:

Log in to the Jenkins web UI as an administrator using the FQDN of your cloud endpoint and port 8081. For example, https://cloud.example.com:8081.
Navigate to Job Config History > Show job configs only.
Click the required job and review the list of recorded changes.

Alternatively, you can access the history of a job by clicking Job Config History from the particular job view itself.

See also

Official plugin documentation

Abort a hung build in Jenkins¶

This section provides the instruction on how to abort the hung Jenkins build if it does not restore after the Jenkins dedicated node is restared, for example.

To abort a hung build, select from the following options

Abort the job build from the Jenkins web UI:
1. Log in to the Jenkins web UI as an Administrator using the FQDN of your cloud endpoint and the 8081 port. For example, https://cloud.example.com:8081.
2. Navigate to Manage Jenkins > Script Console.
3. Run the following script setting the job name and number of the hung build accordingly:
```
def build = Jenkins.instance.getItemByFullName("jobName").getBuildByNumber(jobNumber)
build.doStop()
build.doKill()
```
Abort the job build from a cid node (if the previous option did not help):
1. Log in to any cid node.
2. Run:
```
cd /srv/volumes/jenkins/jobs/<job-name>/builds/
rm -rf <hung-build-number>
```
3. Log in to the Jenkins web UI as an Administrator using the FQDN of your cloud endpoint and the 8081 port. For example, https://cloud.example.com:8081.
4. Navigate to Manage Jenkins > Reload Configuration from Disk, click to reload. Or restart the Jenkins instance.

Enable Jenkins audit logging¶

This section instructs you on how to enable the audit logging in Jenkins by enabling the Audit Trail Jenkins plugin. The plugin allows keeping a log of the users who performed particular Jenkins operations, such as managing and using jobs.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Note

If Jenkins is disabled on the Salt Master node, skip the step 3 of the procedure below.

To setup Audit logging in Jenkins:

Log in to the Salt Master node.
Open the cluster level of your deployment model.

In the cicd/control/leader.yml file, configure any of three logger types that include console, file, and syslog.

Note

By default, only the console output is collected by Fluentd if enabled.

Pillars examples:

For the console logger:

parameters:
  jenkins:
    client:
      audittrail:
        loggers:
          console_logger:
            type: console
            output: STD_OUT
            date_format: "yyyy-MM-dd HH:mm:ss:SSS"
            log_prefix: ""

Note

The date_format and log_prefix parameters in the example above are defaults and can be skipped.

For the file logger:

parameters:
  jenkins:
    client:
      audittrail:
        loggers:
          file_logger:
            type: file
            log: /var/jenkins_home/file_logger.log
            limit: 100
            count: 10

Note

The limit parameter stands for the file limit size in MB. The count parameter stands for the number files to keep.

For the syslog logger:

parameters:
  jenkins:
    client:
      audittrail:
        loggers:
          syslog_logger:
            type: syslog
            syslog_server_hostname: 'syslog.host.org'
            syslog_server_port: 514
            syslog_facility: SYSLOG
            app_name: jenkins
            message_hostname: ""
            message_format: RFC_3164

To configure the audit Logging for Jenkins on the Salt Master node, add the similar pillars to infra/config/jenkins.yml.

Refresh pillars:

salt -C 'I@jenkins:client' saltutil.refresh_pillar

Apply the changes:

salt -C 'I@jenkins:client:audittrail' state.apply jenkins.client.audittrail

Jenkins Matrix-based security authorization¶

The DriveTrain Jenkins uses Matrix-based security authorization by default. It allows you to grant specific permissions to users and groups. Jenkins uses DriveTrain OpenLDAP server as an identity provider and authentication server.

By default, the Jenkins server includes the following user groups:

admins
Contains administrative users with the Jenkins Administer permission.
Authenticated Users
Includes all users authenticated through the DriveTrain OpenLDAP server. This group has no permissions configured by default.

The Matrix-based security plugin enables the operator to configure the following types of permissions:

Overall
Either Administer or Read permissions can be set overall.
Credentials
Permissions to create, delete, update, and view authentication credentials, and manage domains.
Gerrit
Permissions to manually trigger and retrigger Gerrit integration plugin to run specific jobs normally initiated by the plugin.
Agents
Permissions to manage Jenkins agents on worker nodes.
Job
Permissions for specific operations on Jenkins jobs, including build creation, configuration, and execution.
Run
Permissions to run and rerun jobs.
View
Permissions to manage views in the Jenkins UI.
SCM
Permissions to use SCM tags.
Metrics
Permissions to view and configure metrics.
Lockable resources
Permissions to reserve and unlock lockable resources manually.
Artifactory
Permissions to use the Artifactory integration plugin (only if Artifactory is installed).

To configure the Matrix-based Security Authorization:

Log in to the Jenkins web UI as an administrator using the FQDN of your cloud endpoint and port 8081. For example, https://cloud.example.com:8081.
Navigate to Manage Jenkins > Configure Global Security.
Scroll to the Authorization section to view and change the Matrix-based security settings.

See also

Official Jenkins Security documentation

Remove executors on Jenkins master¶

The DriveTrain Jenkins enabled on the Salt Master node enables you to run any job on any Jenkins slave or master. Though, running a job on Jenkins master can lead to job failures since Jenkins master is not intended to be a job executor.

Starting from the MCP 2019.2.4 maintenance update, MCP disables executors on Jenkins master by default to prevent failures of the jobs that run on the Salt Master node. For the MCP versions earlier than 2019.2.4, Mirantis recommends setting zero executors on Jenkins master to disable jobs scheduling on this agent node as described below.

Note

If Jenkins is disabled on the Salt Master node (for details, refer to MCP Deployment Guide: Deploy CI/CD), you can skip the steps below or simply update your cluster configuration without applying the Salt states.

To set zero executors on Jenkins master on the Salt Master node:

In ./classes/cluster/<cluster_name>/infra/config/jenkins.yml, add the following pillar data:

parameters:
  _param:
    jenkins:
      client:
        ...
        node:
          master:
            num_executors: 0
           ...

Refresh pillars:

salt -C 'I@salt:master' saltutil.refresh_pillar

Apply the changes:

salt -C 'I@salt:master' state.apply jenkins.client

Remove anonymous access to Jenkins on the Salt Master node¶

The DriveTrain Jenkins enabled on the Salt Master node is configured to allow anonymous users to access the Jenkins web UI including the listing of the Jenkins jobs and builds in the web UI.

For security reasons, starting from the MCP 2019.2.4 maintenance update, by default, only authorized users have access to Jenkins on the Salt Master node. For the MCP versions earlier than 2019.2.4, Mirantis recommends configuring Jenkins as described below.

Note

To remove anonymous access to Jenkins on the Salt Master node:

Log in to the Salt Master node.
In ./classes/cluster/<cluster_name>/infra/config/jenkins.yml, replace anonymous with authenticated for jenkins_security_matrix_read:
```
parameters:
  _param:
    jenkins_security_matrix_read:
    - authenticated
```

Refresh pillars:

salt -C 'I@salt:master' saltutil.refresh_pillar

Apply the changes:

salt -C 'I@salt:master' state.apply jenkins.client

Use SSH Jenkins slaves¶

By default, Jenkins uses Java Network Launch Protocol (JNLP) for Jenkins slave connection. Starting from the MCP 2019.2.5 maintenance update, you can set up SSH connection for Jenkins slaves instead of JNLP using the steps below.

Note

If Jenkins is disabled on the Salt Master node (for details, refer to MCP Deployment Guide: Deploy CI/CD), skip the steps 2 and 3 of the procedure below.

To use SSH connection instead of JNLP for Jenkins slaves:

Configure Jenkins Master for the Salt Master node to use SSH Jenkins slaves:

Verify your existing SSH keys for Jenkins admin key:

salt-call pillar.get _param:jenkins_admin_public_key_generated
salt-call pillar.get _param:jenkins_admin_private_key_generated

The system output must be not empty.

If you do not have SSH keys, generate ones:

ssh-keygen

In ./classes/cluster/<cluster_name>/infra/config/jenkins.yml:

Replace the system.docker.swarm.stack.jenkins.slave_single or system.docker.swarm.stack.jenkins.jnlp_slave_single class (the one that is present in model) with the following class:
```
classes:
...
- system.docker.swarm.stack.jenkins.ssh_slave_single
```

Remove the following classes if present:

classes:
...
- system.docker.client.images.jenkins_master
- system.docker.client.images.jenkins_slave

Change the Jenkins slave type to ssh instead of jnlp:

parameters:
  ...
  jenkins:
    client:
      node:
        slave01:
          launcher:
            type: ssh

Add the SSH keys parameters to the parameters section:

If you use existing SSH keys:

parameters:
  _param:
    ...
    jenkins_admin_public_key: ${_param:jenkins_admin_public_key_generated}
    jenkins_admin_private_key: ${_param:jenkins_admin_private_key_generated}
    ...

If you generated new SSH keys in the step 2.1:

parameters:
  _param:
    ...
    jenkins_admin_public_key: <ssh-public-key>
    jenkins_admin_private_key: <ssh-private-key>
    ...

Remove the JNLP slave from Jenkins on Salt Master node:
1. Log in to Salt Master node Jenkins web UI.
2. Navigate to Manage Jenkins > Manage nodes.
3. Select slave01 > Delete agent. Click yes to confirm.
Configure Jenkins Master for the cid nodes to use SSH Jenkins slaves:
1. Verify that the Jenkins SSH key is defined in the Reclass model:
```
salt 'cid01*' pillar.get _param:jenkins_admin_public_key
salt 'cid01*' pillar.get _param:jenkins_admin_private_key
```
2. In ./classes/cluster/<cluster_name>/cicd/control/leader.yml:
  1. Replace the system.docker.swarm.stack.jenkins class with system.docker.swarm.stack.jenkins.master and the system.docker.swarm.stack.jenkins.jnlp_slave_multi class with system.docker.swarm.stack.jenkins.ssh_slave_multi if present, or add system.docker.swarm.stack.jenkins.ssh_slave_multi explicitly.
  2. Add the system.jenkins.client.ssh_node class right below the system.jenkins.client.node class:
    classes: ... - system.jenkins.client.node - system.jenkins.client.ssh_node
Remove the JNLP slaves from Jenkins on the cid nodes:
1. Log in to cid Jenkins web UI.
2. Navigate to Manage Jenkins > Manage nodes.
3. Delete slave01, slave02 and slave03 using the menu. For example: slave01 > Delete agent. Click yes to confirm.

Refresh pillars:

salt -C 'I@jenkins:client' saltutil.refresh_pillar
salt -C 'I@docker:client' saltutil.refresh_pillar

Pull the ssh-slave Docker image:

salt -C 'I@docker:client:images' state.apply docker.client.images

Apply the changes:

salt -C 'I@jenkins:client and I@docker:client' state.apply docker.client
salt -C 'I@jenkins:client' state.apply jenkins.client

Enable HTTPS access from Jenkins to Gerrit¶

By default, Jenkins uses the SSH connection to access Gerrit repositories. This section explains how to set up the HTTPS connection from Jenkins to Gerrit repositories.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To enable access from Jenkins to Gerrit through HTTPS:

Log in to the Salt Master node.
Open the cluster level of your deployment model.
In the cicd/control/leader.yml file:
1. Replace the system.jenkins.client.credential.gerrit class with the system.jenkins.client.credential.gerrit_http class:
```
classes:
...
- system.jenkins.client.credential.gerrit_http
```
2. Redefine the jenkins_gerrit_url parameter as follows:
```
jenkins_gerrit_url: "https://${_param:haproxy_gerrit_bind_host}:${_param:haproxy_gerrit_bind_port}"
```

Refresh pillars:

salt -C 'I@jenkins:client and not I@salt:master' saltutil.refresh_pillar

Apply the changes:

salt -C 'I@jenkins:client and not I@salt:master' state.apply jenkins.client

Manage secrets in the Reclass model¶

MCP uses the GPG encryption to protect sensitive data in the Git repositories of the Reclass model. The private key from the encrypted data is stored on the Salt Master node and is available to the root user only. Usually, the data stored in the secrets.yml files located in the /srv/salt/reclass/cluster directory is encrypted. The decryption key is located in a keyring in /etc/salt/gpgkeys.

Note

MCP uses the secrets file name for organizing sensitive data management. If required, you can encrypt data in other files, as well as use unencrypted data in the secrets.yml files.

The secrets encryption feature is not enabled by default. To enable the feature, define secrets_encryption_enabled: 'True' in the Cookiecutter context before the deployment. See MCP Deployment Guide: Infrastructure related parameters: Salt Master for the details.

To change a password:

Get the ID of the private key in question:

# GNUPGHOME=/etc/salt/gpgkeys gpg --list-secret-keys

The machine-readable version of the above command:

# GNUPGHOME=/etc/salt/gpgkeys gpg --list-secret-keys --with-colons | awk -F: -e '/^sec/{print $5}'

Encrypt the new password:

# echo -ne <new_password> | GNUPGHOME=/etc/salt/gpgkeys gpg --encrypt --always-trust -a -r <key_id>

Add the new password to secrets.yml.

To decrypt the data:

To get the decoded value, pass the encrypted value to the command:

# GNUPGHOME=/etc/salt/gpgkeys gpg --decrypt

To change the secret encryption private key:

Add a new key to keyring in /etc/salt/gpgkeys using one of the following options:
- Import the existing key:
```
# GNUPGHOME=/etc/salt/gpgkeys gpg --import < <key_file>
```
- Create a new key:
```
# GNUPGHOME=/etc/salt/gpgkeys gpg --gen-key
```
Replace all encrypted fields in all secrets.yml files with the encrypted value for new key_id.

Configure allowed and rejected IP addresses for the GlusterFS volumes¶

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section provides the instruction on how to configure the list of allowed and rejected IP addresses for the GlusterFS volumes.

By default, MCP restricts the access to the control network for all preconfigured GlusterFS volumes.

To configure the GlusterFS authentication:

Log in to the Salt Master node.
Open your project Git repository with the Reclass model on the cluster level.
In the infra/glusterfs.yml file, configure the GlusterFS authentication depending on the needs of your MCP deployment:
- To adjust the list of allowed and rejected IP addresses on all preconfigured GlusterFS volumes, define the glusterfs_allow_ips and glusterfs_reject_ips parameters as required:
```
parameters:
  _param:
    glusterfs_allow_ips: <comma-seprated list of IPs>
    glusterfs_reject_ips: <comma-seprated list of IPs>
```
  Note
  
  You can use the \* wildcard to specify the IP ranges.
  
  Configuration example:
```
parameters:
  _param:
    glusterfs_allow_ips: 10.0.0.1, 192.168.1.*
    glusterfs_reject_ips: 192.168.1.201
```
  The configuration above allows the access to all GlusterFS volumes from 10.0.0.1 and all IP addresses in the 192.168.1.0/24 network except for 192.168.1.201.
- To change allowed and rejected IP addresses for a single volume:
```
parameters:
  glusterfs:
    server:
      volumes:
        <volume_name>:
          options:
            auth.allow: <comma-seprated_list_of_IPs>
            auth.reject: <comma-seprated_list_of_IPs>
```
- To define the same access-control lists (ACL) as for all preconfigured GlusterFS volumes to a custom GlusterFS volume, define the auth.allow and auth.reject options for the targeted volume as follows:
```
auth.allow: ${_param:glusterfs_allow_ips}
auth.reject: ${_param:glusterfs_reject_ips}
```

Apply the changes:

salt -I 'glusterfs:server:role:primary' state.apply glusterfs

Manage users in OpenLDAP¶

DriveTrain uses OpenLDAP to provide authentication and metadata for MCP users. This section describes how to create a new user entry in the OpenLDAP service through the Reclass cluster metadata model and grant the user permissions to access Gerrit and Jenkins.

To add a user to an OpenLDAP server:

Log in to the Salt Master node.
Check out the latest version of the Reclass cluster metadata model from the Git repository for your project.
Create a new directory called people in classes/cluster/<CLUSTER_NAME>/cicd/:
```
mkdir classes/cluster/<cluster_name>/cicd/people
```
New user definitions will be added to this directory.
Create a new YAML file in the people directory for a new user. For example, joey.yml:
```
touch classes/cluster/<cluster_name>/cicd/people/joey.yml
```

In the newly created file, add the user definition. For example:

parameters:
  _param:
    openldap_pw_joey: "<ENCRYPTED_PASSWORD>"
  openldap:
    client:
      entry:
        people:
          entry:
            jdoe:
              attr:
                uid: joey
                userPassword: ${_param:openldap_pw_joey}
                uidNumber: 20600
                gidNumber: 20001
                gecos: "Joey Tribbiani"
                givenName: Joey
                sn: Tribbiani
                homeDirectory: /home/joey
                loginShell: /bin/bash
                mail: joey@domain.tld
              classes:
                - posixAccount
                - inetOrgPerson
                - top
                - shadowAccount

Parameters description:

openldap_pw_joey

The user password for the joey user that can be created using the following example command:

echo "{CRYPT}$(mkpasswd --rounds 500000 -m sha-512 \
  --salt `head -c 40 /dev/random | base64 | sed -e 's/+/./g' \
  |  cut -b 10-25` 'r00tme')"

Substitute r00tme with a user encrypted password.

uid
The case-sensitive user ID to be used as a login ID for Gerrit, Jenkins, and other integrated services.
userPassword: ${_param:openldap_pw_joey}
The password for the joey user, same as the openldap_pw_joey value.
gidNumber
An integer uniquely identifying a group in an administrative domain, which a user should belong to.
uidNumber
An integer uniquely identifying a user in an administrative domain.

Add the new user definition from joey.yml as a class in classes/cluster/<CLUSTER_NAME>/cicd/control/leader.yml:
```
classes:
  ...
  - cluster.<CLUSTER_NAME>.cicd.control
  - cluster.<CLUSTER_NAME>.cicd.people.joey
```
By defining the cluster level parameters of the joey user and including it in the classes section of cluster/<CLUSTER_NANE>/cicd/control/leader.yml, you import the user data to the cid01 node inventory, although the parameter has not been rendered just yet.
Commit the change.
Update the copy of the model on the Salt Master node:
```
sudo git -C /srv/salt/reclass pull
```
Synchronize all Salt resources:
```
sudo salt '*' saltutil.sync_all
```

Apply the changes:

sudo salt 'cid01*' state.apply openldap

Example output for a successfully created user:

        ID: openldap_client_cn=joey,ou=people,dc=deploy-name,dc=local
  Function: ldap.managed
    Result: True
   Comment: Successfully updated LDAP entries
   Started: 18:12:29.788665
  Duration: 58.193 ms
   Changes:
            ----------
            cn=joey,ou=people,dc=deploy-name,dc=local:
                ----------
                new:
                    ----------
                    cn:
                        - joey
                    gecos:
                        - Joey Tribbiani
                    gidNumber:
                        - 20001
                    givenName:
                        - Joey
                    homeDirectory:
                        - /home/joey
                    loginShell:
                        - /bin/bash
                    mail:
                        - joey@domain.tld
                    objectClass:
                        - inetOrgPerson
                        - posixAccount
                        - shadowAccount
                        - top
                    sn:
                        - Tribbiani
                    uid:
                        - joey
                    uidNumber:
                        - 20060
                    userPassword:
                        - {CRYPT}$6$rounds=500000$KaJBYb3F8hYMv.UEHvc0...
                old:
                    None

Summary for cid01.domain.tld
------------
Succeeded: 7 (changed=1)
Failed:    0
------------
Total states run:     7
Total run time: 523.672 ms

Disable LDAP authentication on host OS¶

This section describes how to disable LDAP authentication on a host operating system.

To disable LDAP authentication:

Open your Git project repository with the Reclass model on the cluster level.
In cluster/<cluster_name>/infra/auth/ldap.yml, disable LDAP:
```
ldap:
  enabled: false
```

Enforce the linux.system update:

salt '<target_node>*' state.sls linux.system

Clean up nodes:

salt '<target_node>*' cmd.run 'export DEBIAN_FRONTEND=noninteractive; apt purge -y libnss-ldapd libpam-ldapd; sed -i "s/ ldap//g" /etc/nsswitch.conf'

Enable Gerrit audit logging¶

This section instructs you on how to enable the audit logging in Gerrit by configuring the httpd requests logger. Fluentd collects the files with the error logs automatically.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To set up audit logging in Gerrit:

Log in to the Salt Master node.
Open the cluster level of your deployment model.

In the cicd/control/leader.yml file, add following parameters:

parameters:
  _param:
    ...
    gerrit_extra_opts: "-Dlog4j.configuration=file:///var/gerrit/review_site/etc/log4j.properties"
    gerrit_http_request_log: 'True'
    ...
  linux:
    system:
      file:
        "/srv/volumes/gerrit/etc/log4j.properties":
          contents:
            - log4j.logger.httpd_log=INFO,httpd_log
            - log4j.appender.httpd_log=org.apache.log4j.ConsoleAppender
            - log4j.appender.httpd_log.layout=com.google.gerrit.pgm.http.jetty.HttpLogLayout

Refresh pillars:

salt -C 'I@gerrit:client' saltutil.refresh_pillar

Create the log4j.properties file:

salt -C 'I@gerrit:client' state.apply linux.system.file

Update the Gerrit service:

salt -C 'I@gerrit:client' state.apply docker.client

Configure log rotation using logrotate¶

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to configure log rotation for selected services using the logrotate utility.

The services that support the log rotation configuration include:

OpenStack services
Aodh, Barbican, Ceilometer, Cinder, Designate, Glance, Gnocchi, Heat, Keystone, Neutron, Nova, Octavia, Ironic
Other services
atop, Backupninja, Ceph, Elasticsearch, Galera (MySQL), GlusterFS, HAProxy, libvirt, MAAS, MongoDB, NGINX, Open vSwitch, PostgreSQL, RabbitMQ, Redis, Salt, Telegraf

MCP supports configuration of the rotation interval and number of rotations. Configuration of other logrotate options, postrotate and prerotate actions, and so on, are not supported.

To configure log rotation:

Log in to the Salt Master node.
Open the cluster level of your deployment model.

Configure the interval and rotate parameters for the target service as required:

logrotate:interval
Define the rotation time interval. Available values daily, weekly, monthly, and yearly.
logrotate:rotate
Define the number of the rotated logs to be kept. The parameter expects the interger value.

Use the Logrotate configuration table below to determine where to add the log rotation configuration.

Logrotate configuration¶
Service	Pillar path (target)	File path
Aodh	`aodh:server`	`openstack/telemetry.yml`
atop	`linux:system:atop`	The root file of the component [7]
Backupninja	`backupninja:client`	`infra/backup/client_common.yml`
Barbican	`barbican:server`	`openstack/barbican.yml`
Ceilometer server [0]	`ceilometer:server`	`openstack/telemetry.yml`
Ceilometer agent [0]	`ceilometer:agent`	`openstack/compute/init.yml`
Ceph	`ceph:common`	`ceph/common.yml`
Cinder controller [1]	`cinder:controller`	`openstack/control.yml`
Cinder volume [1]	`cinder:volume`	`openstack/control.yml`
Designate	`designate:server`	`openstack/control.yml`
Elasticsearch server	`elasticsearch:server`	`stacklight/log.yml`
Elasticsearch client	`elasticsearch:client`	`stacklight/log.yml`
Galera (MySQL) master	`galera:master`	`openstack/database/master.yml`
Galera (MySQL) slave	`galera:slave`	`openstack/database/slave.yml`
Glance	`glance:server`	`openstack/control.yml`
GlusterFS server [2]	`glusterfs:server`	The root file of the component [7]
GlusterFS client [2]	`glusterfs:client`	The root file of the component [7]
Gnocchi server	`gnocchi:server`	`openstack/telemetry.yml`
Gnocchi client	`gnocchi:client`	`openstack/control/init.yml`
HAProxy	`haproxy:proxy`	`openstack/proxy.yml`
Heat	`heat:server`	`openstack/control.yml`
Ironic ^{Available since 2019.2.13}	`ironic:api`	`openstack/baremetal.yml`
Keystone server	`keystone:server`	`openstack/control.yml`
Keystone client	`keystone:client`	`openstack/control/init.yml`
libvirt	`nova:compute:libvirt` [3]	`openstack/compute/init.yml`
MAAS	`maas:region`	`infra/maas.yml`
MongoDB	`mongodb:server`	`stacklight/server.yml`
Neutron server	`neutron:server`	`openstack/control.yml`
Neutron client	`neutron:client`	`openstack/control/init.yml`
Neutron gateway	`neutron:gateway`	`openstack/gateway.yml`
Neutron compute	`neutron:compute`	`openstack/compute/init.yml`
NGINX	`nginx:server`	`openstack/proxy.yml`, `stacklight/proxy.yml`
Nova controller	`nova:controller`	`openstack/control.yml`
Nova compute	`nova:compute`	`openstack/compute/init.yml`
Octavia manager [4]	`octavia:manager`	`openstack/octavia_manager.yml`
Octavia client [4]	`octavia:client`	`openstack/control.yml`
Open vSwitch	`linux:network:openvswitch`	`infra/init.yml`
PostgreSQL server	`postgresql:server` (`maas:region`) [5]	`infra/config/postgresql.yml` (`infra/maas.yml`)
PostgreSQL client	`postgresql:client` (`maas:region`) [5]	`infra/config/postgresql.yml` (`infra/maas.yml`)
RabbitMQ	`rabbitmq:server`	`openstack/message_queue.yml`
Redis	`redis:server`	`openstack/telemetry.yml`
Salt master [6]	`salt:master`	`infra/config/init.yml`
Salt minion [6]	`salt:minion`	The root file of the component [7]
Telegraf	`telegraf:agent`	`infra/init.yml`, `stacklight/server.yml`

[0]	(1, 2) If Ceilometer server and agent are specified on the same node, the server configuration is prioritized.

[1]	(1, 2) If Cinder controller and volume are specified on the same node, the controller configuration is prioritized.

[2]	(1, 2) If GlusterFS server and client are specified on the same node, the server configuration is prioritized.

[3]	Use `nova:compute:libvirt` as pillar path, but only `nova:compute` as target.

[4]	(1, 2) If Octavia manager and client are specified on the same node, the manager configuration is prioritized.

[5]	(1, 2) PostgreSQL is the dependenсу of MAAS. Configure PostgreSQL from the MAAS pillar only if the service has been installed as a dependency without the `postgresql` pillar defined. If the `postgresql` pillar is defined, configure it instead.

[6]	(1, 2) If the Salt Master and minion are specified on the same node, the master configuration is prioritized.

[7]	(1, 2, 3, 4) Depending on the nodes where you want to change the configuration, select their components’ root file. For example, `infra/init.yml`, `openstack/control/init.yml`, `cicd/init.yml`, and so on.

For example, to set log rotation for Aodh to keep logs for the last 4 weeks with the daily rotation interval, add the following configuration to cluster/<cluster_name>/openstack/telemetry.yml:

parameters:
  aodh:
    server:
      logrotate:
        interval: daily
        rotate: 28

Apply the logrotate state on the node with the target service:

salt -C 'I@<target>' saltutil.sync_all
salt -C 'I@<target>' state.sls logrotate

For example:

salt -C 'I@aodh:server' state.sls logrotate

Configure remote logging for auditd¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to configure remote logging for auditd.

To configure remote logging for auditd:

Log in to the Salt Master node.
In the classes/cluster/<cluster_name>/ directory, open one of the following files:
- To configure one remote host for auditd for all nodes, use infra/init.yml.
- To configure a remote host for a set of nodes, use a specific configuration file. For example, openstack/compute/init.yml for all OpenStack compute nodes.

Configure the remote host using the following exemplary pillar:

parameters:
  audisp:
    enabled: true
    remote:
      remote_server: <ip_address or hostname>
      port: <port>
      local_port: any
      transport: tcp
      ...
      key1: value1

Refresh pillars on the target nodes:
```
salt <nodes> saltutil.refresh_pillar
```
Apply the auditd.audisp state on the target nodes:
```
salt <nodes> state.apply auditd.audisp
```

Configure memory limits for the Redis server¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to configure the memory rules and limits for the Redis server.

To configure memory limits for the Redis server:

Log in to the Salt Master node.
In classes/cluster/<cluster_name>/openstack/telemetry.yml, specify the following parameters as required:
```
parameters:
  redis:
    server:
      maxmemory: 1073741824 # 1GB
      maxmemory-policy: <memory-policy>
      maxmemory-samples: 3
```
Supported values for the memory-policy parameter include:
- volatile-lru - the service removes the key with expiration time set using the Least Recently Used (LRU) algorithm
- allkeys-lru - the service removes any key according to the LRU algorithm
- volatile-random - the service removes a random key with an expiration time set
- allkeys-random - the service removes any random key
- volatile-ttl - the service removes the key with the nearest expiration time (minor TTL)
- noeviction - the service does not remove any key but returns an error on write operations

Apply the changes:

salt -C 'I@redis:server' saltutil.refresh_pillar
salt -C 'I@redis:server' state.apply redis.server

Configure multiple NTP servers¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

MCP enables you to configure multiple Network Time Protocol (NTP) servers on new or existing MCP clusters to provide a more flexible and wide NTP support for clustered applications such as Ceph, Galera, and others.

For new MCP clusters, configure multiple NTP servers during the deployment model creation using the ntp_servers parameter passed to Cookiecutter in the following format:

server1.ntp.org,server2.ntp.org,server3.ntp.org

For details, see Networking deployment parameters in MCP Deployment Guide: Create a deployment metadata model.

For existing MCP clusters, configure multiple NTP servers by updating the NTP configuration for MAAS and all nodes of an MCP cluster.

To configure multiple NTP servers for MAAS:

Log in to the Salt Master node.
Open the cluster level of your deployment model.

In infra/maas.yml, update the MAAS pillars using the example below:

parameters:
  maas:
    region:
      ...
      ntp:
        server1:
          enabled: True
          host: ntp.example1.org
        server2:
          enabled: True
          host: ntp.example2.o

Update the MAAS configuration:

salt-call saltutil.refresh_pillar
salt-call state.apply maas.region

To configure multiple NTP servers for all nodes of an MCP cluster:

Log in to the Salt Master node.
Open the cluster level of your deployment model.

In infra/init.yml, update the MAAS pillars using the example below:

parameters:
  ntp:
    client:
    enabled: true
    stratum:
      primary:
        server: primary.ntp.org
      secondary: # if exist
        server: secondary.ntp.org
      srv_3: # if exist
        server: srv_3.ntp.org

Update the NTP configuration:

salt '*' saltutil.refresh_pillar
salt '*' state.apply ntp.client

Configure kernel crash dumping¶

Note

This feature is available starting from the MCP 2019.2.16 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

The Kernel Crash Dump (kdump) mechanism allows you to save the system memory events for later analysis. This section instructs you on how to configure kernel crash dumping for all or particular nodes. The dump will be saved in the /var/crash directory. For more details, see Kernel Crash Dump.

To configure kernel crash dumping:

Log in to the Salt Master node.
Open the cluster level of your deployment model.

In <cluster_name>/infra/init.yml, add the following variable:

parameters:
  linux:
    system:
      kernel_crash_dump:
        enabled: true

Apply the changes for the required nodes or for all nodes. To apply the changes for all nodes:
```
salt -C '*' state.sls linux.system.kernel_crash_dump
```
Manually reboot the nodes for which you have enabled kernel crash dumping. For example, to reboot the OpenStack compute nodes:
```
salt -C 'cmp1*' system.reboot --async
```

Once done, you can view the dump in the /var/crash directory.

OpenStack operations¶

This section includes all OpenStack-related Day-2 operations such as reprovisioning of OpenStack controller and compute nodes, preparing the Ironic service to provision cloud workloads on bare metal nodes, and others.

Manage Virtualized Control Plane¶

This section describes operations with the MCP Virtualized Control Plane (VCP).

Add a controller node¶

If you need to expand the size of VCP to handle a bigger data plane, you can add more controller nodes to your cloud environment. This section instructs on how to add a KVM node and an OpenStack controller VM to an existing environment.

The same procedure can be applied for scaling the messaging, database, and any other services.

Additional parameters will have to be added before the deployment.

To add a controller node:

Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.
Log in to the Salt Master node.

In the /classes/cluster/<cluster_name>/infra/init.yml file, define the basic parameters for the new KVM node:

parameters:
  _param:
    infra_kvm_node04_address: <IP ADDRESS ON CONTROL NETWORK>
    infra_kvm_node04_deploy_address: <IP ADDRESS ON DEPLOY NETWORK>
    infra_kvm_node04_storage_address: ${_param:infra_kvm_node04_address}
    infra_kvm_node04_public_address: ${_param:infra_kvm_node04_address}
    infra_kvm_node04_hostname: kvm<NUM>
    glusterfs_node04_address: ${_param:infra_kvm_node04_address}
linux:
  network:
    host:
      kvm04:
        address: ${_param:infra_kvm_node04_address}
        names:
        - ${_param:infra_kvm_node04_hostname}
        - ${_param:infra_kvm_node04_hostname}.${_param:cluster_domain}

In the /classes/cluster/<cluster_name>/openstack/init.yml file, define the basic parameters for the new OpenStack controller node.

openstack_control_node<NUM>_address: <IP_ADDRESS_ON_CONTROL_NETWORK>
openstack_control_node<NUM>_hostname: <HOSTNAME>
openstack_database_node<NUM>_address: <DB_IP_ADDRESS>
openstack_database_node<NUM>_hostname: <DB_HOSTNAME>
openstack_message_queue_node<NUM>_address: <IP_ADDRESS_OF_MESSAGE_QUEUE>
openstack_message_queue_node<NUM>_hostname: <HOSTNAME_OF_MESSAGE_QUEUE>

Example of configuration:

kvm04_control_ip: 10.167.4.244
kvm04_deploy_ip: 10.167.5.244
kvm04_name: kvm04
openstack_control_node04_address: 10.167.4.14
openstack_control_node04_hostname: ctl04

In the /classes/cluster/<cluster_name>/infra/config.yml file, define the configuration parameters for the KVM and OpenStack controller nodes. For example:

reclass:
  storage:
    node:
      infra_kvm_node04:
        name: ${_param:infra_kvm_node04_hostname}
        domain: ${_param:cluster_domain}
        classes:
        - cluster.${_param:cluster_name}.infra.kvm
        params:
          keepalived_vip_priority: 103
          salt_master_host: ${_param:reclass_config_master}
          linux_system_codename: xenial
          single_address: ${_param:infra_kvm_node04_address}
          deploy_address: ${_param:infra_kvm_node04_deploy_address}
          public_address: ${_param:infra_kvm_node04_public_address}
          storage_address: ${_param:infra_kvm_node04_storage_address}
      openstack_control_node04:
        name: ${_param:openstack_control_node04_hostname}
        domain: ${_param:cluster_domain}
        classes:
        - cluster.${_param:cluster_name}.openstack.control
        params:
          salt_master_host: ${_param:reclass_config_master}
          linux_system_codename: xenial
          single_address: ${_param:openstack_control_node04_address}
          keepalived_vip_priority: 104
          opencontrail_database_id: 4
          rabbitmq_cluster_role: slave

In the /classes/cluster/<cluster_name>/infra/kvm.yml file, define new brick for GlusterFS on all KVM nodes and salt:control which later spawns the OpenStack controller node. For example:

_param:
  cluster_node04_address: ${_param:infra_kvm_node04_address}
glusterfs:
  server:
    volumes:
      glance:
        replica: 4
        bricks:
          - ${_param:cluster_node04_address}:/srv/glusterfs/glance
      keystone-keys:
        replica: 4
        bricks:
          - ${_param:cluster_node04_address}:/srv/glusterfs/keystone-keys
      keystone-credential-keys:
        replica: 4
        bricks:
          - ${_param:cluster_node04_address}:/srv/glusterfs/keystone-credential-keys
salt:
  control:
    cluster:
      internal:
        domain: ${_param:cluster_domain}
        engine: virt
        node:
          ctl04:
            name: ${_param:openstack_control_node04_hostname}
            provider: ${_param:infra_kvm_node04_hostname}.${_param:cluster_domain}
            image: ${_param:salt_control_xenial_image}
            size: openstack.control

In the /classes/cluster/<cluster_name>/openstack/control.yml file, add the OpenStack controller node into existing services such as HAProxy, and others, depending on your environment configuration.

Example of adding an HAProxy host for Glance:

_param:
  cluster_node04_hostname: ${_param:openstack_control_node04_hostname}
  cluster_node04_address: ${_param:openstack_control_node04_address}
haproxy:
  proxy:
    listen:
      glance_api:
        servers:
        - name: ${_param:cluster_node04_hostname}
          host: ${_param:cluster_node04_address}
          port: 9292
          params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3
      glance_registry_api:
        servers:
        - name: ${_param:cluster_node04_hostname}
          host: ${_param:cluster_node04_address}
          port: 9191
          params: check

Refresh the deployed pillar data by applying the reclass.storage state:
```
salt '*cfg*' state.sls reclass.storage
```
Verify that the target node has connectivity with the Salt Master node:
```
salt '*kvm<NUM>*' test.ping
```
Verify that the Salt Minion nodes are synchronized:
```
salt '*' saltutil.sync_all
```
On the Salt Master node, apply the Salt linux state for the added node:
```
salt -C 'I@salt:control' state.sls linux
```
On the added node, verify that salt-common and salt-minion have the 2017.7 version.
```
apt-cache policy salt-common
apt-cache policy salt-minion
```
Note

If the commands above show a different version, follow the MCP Deployment guide: Install the correct versions of salt-common and salt-minion.

Perform the initial Salt configuration:

salt -C 'I@salt:control' state.sls salt.minion

Set up the network interfaces and the SSH access:

salt -C 'I@salt:control' state.sls linux.system.user,openssh,linux.network,ntp

Reboot the KVM node:
```
salt '*kvm<NUM>*' cmd.run 'reboot'
```

On the Salt Master node, apply the libvirt state:

salt -C 'I@salt:control' state.sls libvirt

On the Salt Master node, create a controller VM for the added physical node:
```
salt -C 'I@salt:control' state.sls salt.control
```
Note

Salt virt takes the name of a virtual machine and registers the virtual machine on the Salt Master node.

Once created, the instance picks up an IP address from the MAAS DHCP service and the key will be seen as accepted on the Salt Master node.
Verify that the controller VM has connectivity with the Salt Master node:
```
salt 'ctl<NUM>*' test.ping
```
Verify that the Salt Minion nodes are synchronized:
```
salt '*' saltutil.sync_all
```

Apply the Salt highstate for the controller VM:

salt -C 'I@salt:control' state.highstate

Verify that the added controller node is registered on the Salt Master node:
```
salt-key
```
To reconfigure VCP VMs, run the openstack-deploy Jenkins pipeline with all necessary install parmeters as described in MCP Deployment guide: Deploy an OpenStack environment.

Replace a KVM node¶

If a KVM node hosting the Virtualized Control Plane has failed and recovery is not possible, you can recreate the KVM node from scratch with all VCP VMs that were hosted on the old KVM node. The replaced KVM node will be assigned the same IP addresses as the failed KVM node.

Replace a failed KVM node¶

This section describes how to recreate a failed KVM node with all VCP VMs that were hosted on the old KVM node. The replaced KVM node will be assigned the same IP addresses as the failed KVM node.

To replace a failed KVM node:

Log in to the Salt Master node.
Copy and keep the hostname and GlusterFS UUID of the old KVM node.

To obtain the UUIDs of all peers in the cluster:
```
salt '*kvm<NUM>*' cmd.run "gluster peer status"
```
Note

Run the command above from a different KVM node of the same cluster since the command outputs other peers only.
Verify that the KVM node is not registered in salt-key. If the node is present, remove it:
```
salt-key | grep kvm<NUM>
salt-key -d kvm<NUM>.domain_name
```
Remove the salt-key records for all VMs originally running on the failed KVM node:
```
salt-key -d <kvm_node_name><NUM>.domain_name
```
Note

You can list all VMs running on the KVM node using the salt '*kvm<NUM>*' cmd.run 'virsh list --all' command. Alternatively, obtain the list of VMs from cluster/infra/kvm.yml.
Add or reprovision a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.
Verify that the new node has been registered on the Salt Master node successfully:
```
salt-key | grep kvm
```
Note

If the new node is not available in the list, wait some time until the node becomes available or use the IPMI console to troubleshoot the node.
Verify that the target node has connectivity with the Salt Master node:
```
salt '*kvm<NUM>*' test.ping
```
Verify that salt-common and salt-minion have the same version for the new node as the rest of the cluster.
```
salt -t 10 'kvm*' cmd.run 'dpkg -l |grep "salt-minion\|salt-common"'
```
Note

If the command above shows a different version for the new node, follow the steps described in Install the correct versions of salt-common and salt-minion.
Verify that the Salt Minion nodes are synchronized:
```
salt '*' saltutil.refresh_pillar
```
Apply the linux state for the added node:
```
salt '*kvm<NUM>*' state.sls linux
```

Perform the initial Salt configuration:

Run the following commands:

salt '*kvm<NUM>*' cmd.run "touch /run/is_rebooted"
salt '*kvm<NUM>*' cmd.run 'reboot'

Wait some time before the node is rebooted.

Verify that the node is rebooted:

salt '*kvm<NUM>*' cmd.run 'if [ -f "/run/is_rebooted" ];then echo \
"Has not been rebooted!";else echo "Rebooted";fi'

Note

The node must be in the Rebooted state.

Set up the network interfaces and the SSH access:

salt -C 'I@salt:control' state.sls linux.system.user,openssh,linux.network,ntp

Apply the libvirt state for the added node:
```
salt '*kvm<NUM>*' state.sls libvirt
```
Recreate the original VCP VMs on the new node:
```
salt '*kvm<NUM>*' state.sls salt.control
```
Note

Salt virt takes the name of a VM and registers it on the Salt Master node.

Once created, the instance picks up an IP address from the MAAS DHCP service and the key will be seen as accepted on the Salt Master node.
Verify that the added VCP VMs are registered on the Salt Master node:
```
salt-key
```
Verify that the Salt Minion nodes are synchronized:
```
salt '*' saltutil.sync_all
```
Apply the highstate for the VCP VMs:
```
salt '*kvm<NUM>*' state.highstate
```
Verify whether the new node has correct IP address and proceed to restore GlusterFS configuration as described in Recover GlusterFS on a replaced KVM node.

Recover GlusterFS on a replaced KVM node¶

After you replace a KVM node as described in Replace a failed KVM node, if your new KVM node has the same IP address, proceed with recovering GlusterFS as described below.

To recover GlusterFS on a replaced KVM node:

Log in to the Salt Master node.
Define the IP address of the failed and any working KVM node that is running the GlusterFS cluster services. For example:
```
FAILED_NODE_IP=<IP_of_failed_kvm_node>
WORKING_NODE_IP=<IP_of_working_kvm_node>
```
If the failed node has been recovered with the old disk and GlusterFS installed:
1. Remove the /var/lib/glusterd directory:
```
salt -S $FAILED_NODE_IP file.remove '/var/lib/glusterd'
```
2. Restart glusterfs-server:
```
salt -S $FAILED_NODE_IP service.restart glusterfs-server
```

Configure glusterfs-server on the failed node:

salt -S $FAILED_NODE_IP state.apply glusterfs.server.service

Remove the failed node from the GlusterFS cluster:

salt -S $WORKING_NODE_IP cmd.run "gluster peer detach $FAILED_NODE_IP"

Re-add the failed node to the GlusterFS cluster with a new ID:

salt -S $WORKING_NODE_IP cmd.run "gluster peer probe $FAILED_NODE_IP"

Finalize the configuration of the failed node:
```
salt -S $FAILED_NODE_IP state.apply
```

Set the correct trusted.glusterfs.volume-id attribute in the GlusterFS directories on the failed node:

for vol in $(salt --out=txt -S $WORKING_NODE_IP cmd.run 'for dir in /srv/glusterfs/*; \
do echo -n "${dir}@0x"; getfattr  -n trusted.glusterfs.volume-id \
--only-values --absolute-names $dir | xxd -g0 -p;done' | awk -F: '{print $2}'); \
do VOL_PATH=$(echo $vol| cut -d@ -f1); TRUST_ID=$(echo $vol | cut -d@ -f2); \
salt -S $FAILED_NODE_IP cmd.run "setfattr -n trusted.glusterfs.volume-id -v $TRUST_ID $VOL_PATH"; \
done

Restart glusterfs-server:

salt -S $FAILED_NODE_IP service.restart glusterfs-server

Move a VCP node to another host¶

To ensure success during moving the VCP VMs running in the cloud environment for specific services, take a single VM at a time, stop it, move the disk to another host, and start the VM again on the new host machine. The services running on the VM should remain running during the whole process due to high availability ensured by Keepalived and HAProxy.

To move a VCP node to another host:

To synchronize your deployment model with the new setup, update the /classes/cluster/<cluster_name>/infra/kvm.yml file:

salt:
  control:
    cluster:
      internal:
        node:
          <nodename>:
            name: <nodename>
            provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
            # replace 'infra_kvm_node03_hostname' param with the new kvm nodename provider

Apply the salt.control state on the new KVM node:
```
salt-call state.sls salt.control
```

Destroy the newly spawned VM on the new KVM node:

virsh list
virsh destroy <nodename><nodenum>.<domainname>

Stop the VM:

virsh list
virsh destroy <nodename><nodenum>.<domainname>

Move the disk to the new KVM node using, for exmaple, the scp utility, replacing the empty disk spawned by the salt.control state with the correct one:

scp /var/lib/libvirt/images/<nodename><nodenum>.<domainname>/system.qcow2 \
<diff_kvm_nodename>:/var/lib/libvirt/images/<nodename><nodenum>.<domainname>/system.qcow2

Start the VM on the new KVM host:

virsh start <nodename><nodenum>.<domainname>

Verify that the services on the moved VM work correctly.
Log in to the KVM node that was hosting the VM originally and undefine it:
```
virsh list --all
virsh undefine <nodename><nodenum>.<domainname>
```

Enable host passthrough for VCP¶

Note

This feature is available starting from the MCP 2019.2.16 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable the host-passthrough CPU mode that can enhance performance of the MCP Virtualized Control Plane (VCP). For details, see libvirt documentation: CPU model and topology.

Warning

Prior to enabling the host passthrough, run the following command to verify that it is applicable to your deployment:

salt -C "I@salt:control" cmd.run "virsh list | tail -n +3 | awk '{print \$1}' | xargs -I{} virsh dumpxml {} | grep cpu_mode"

If the output is empty, proceed to enabling host passthrough. Otherwise, first contact Mirantis support.

To enable host passthrough:

Obtain the list of running VMs:

virsh list

Example of system response:

Id  Name                                State
------------------------------------------------
msg01.bm-cicd-queens-ovs-maas.local running
rgw01.bm-cicd-queens-ovs-maas.local running
dbs01.bm-cicd-queens-ovs-maas.local running
bmt01.bm-cicd-queens-ovs-maas.local running
kmn01.bm-cicd-queens-ovs-maas.local running
cid01.bm-cicd-queens-ovs-maas.local running
cmn01.bm-cicd-queens-ovs-maas.local running
ctl01.bm-cicd-queens-ovs-maas.local running

Edit the configuration of each VM using the virsh edit %VM_NAME% command. Add the following lines to the XML configuration file:

<cpu mode='host-passthrough'>
  <cache mode='passthrough'/>
</cpu>

For example:

<domain type='kvm'>
  <name>msg01.bm-cicd-queens-ovs-maas.local</name>
  <uuid>81e18795-cf2f-4ffc-ac90-9fa0a3596ffb</uuid>
  <memory unit='KiB'>67108864</memory>
  <currentMemory unit='KiB'>67108864</currentMemory>
  <vcpu placement='static'>16</vcpu>
  <cpu mode='host-passthrough'>
    <cache mode='passthrough'/>
  </cpu>
  <os>
    <type arch='x86_64' machine='pc-i440fx-bionic'>hvm</type>
    <boot dev='hd'/>
  </os>
.......

Perform the steps 1-3 on the remaning kvm nodes one by one.
Log in to the Salt Master node.
Reboot the VCP nodes as described in Scheduled maintenance with a planned power outage using the salt 'nodename01*' system.reboot command. Do not reboot the kvm, apt, and cmp nodes.

Warning

Reboot nodes one by one instead of rebooting all nodes of the same role at a time. Wait for 10 minutes between each reboot.

Manage compute nodes¶

This section provides instructions on how to manage the compute nodes in your cloud environment.

Add a compute node¶

This section describes how to add a new compute node to an existing OpenStack environment.

To add a compute node:

Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

Verify that the compute node is defined in /classes/cluster/<cluster_name>/infra/config.yml.

Note

Create as many hosts as you have compute nodes in your environment within this file.

Note

Verify that the count parameter is increased by the number of compute nodes being added.

Configuration example if the dynamic compute host generation is used:

reclass:
  storage:
    node:
      openstack_compute_rack01:
        name: ${_param:openstack_compute_rack01_hostname}<<count>>
        domain: ${_param:cluster_domain}
        classes:
        - cluster.${_param:cluster_name}.openstack.compute
        repeat:
          count: 20
          start: 1
          digits: 3
          params:
            single_address:
              value: 172.16.47.<<count>>
              start: 101
            tenant_address:
              value: 172.16.47.<<count>>
              start: 101
        params:
          salt_master_host: ${_param:reclass_config_master}
          linux_system_codename: xenial

Configuration example if the static compute host generation is used:

reclass:
  storage:
    node:
      openstack_compute_node01:
        name: cmp01
        domain: ${_param:cluster_domain}
        classes:
        - cluster.${_param:cluster_name}.openstack.compute
        params:
          salt_master_host: ${_param:reclass_config_master}:
          linux_system_codename: xenial
          single_address: 10.0.0.101
          deploy_address: 10.0.1.101
          tenant_address: 10.0.2.101

Define the cmp<NUM> control address and hostname in the <cluster>/openstack/init.yml file:

_param:
  openstack_compute_node<NUM>_address: <control_network_IP>
  openstack_compute_node<NUM>_hostname: cmp<NUM>

linux:
  network:
    host:
      cmp<NUM>:
        address: ${_param:openstack_compute_node<NUM>_address}
        names:
        - ${_param:openstack_compute_node<NUM>_hostname}
        - ${_param:openstack_compute_node<NUM>_hostname}.${_param:cluster_domain}

Apply the reclass.storage state on the Salt Master node to generate node definitions:
```
salt '*cfg*' state.sls reclass.storage
```
Verify that the target nodes have connectivity with the Salt Master node:
```
salt '*cmp<NUM>*' test.ping
```

Apply the following states:

salt 'cfg*' state.sls salt.minion.ca
salt '*cmp<NUM>*' state.sls salt.minion.cert

Deploy a new compute node as described in MCP Deployment Guide: Deploy physical servers.
Caution

Do not use compounds for this step, since it will affect already running physical servers and reboot them. Use the Salt minion IDs instead of compounds before running the pipelines or deploying physical servers manually.

Incorrect:
```
salt -C 'I@salt:control or I@nova:compute or I@neutron:gateway' \
 cmd.run "touch /run/is_rebooted"
salt --async -C 'I@nova:compute' cmd.run 'salt-call state.sls \
 linux.system.user,openssh,linux.network;reboot'
```
Correct:
```
salt cmp<NUM> cmd.run "touch /run/is_rebooted"
salt --async cmp<NUM> cmd.run 'salt-call state.sls \
 linux.system.user,openssh,linux.network;reboot'
```
Note

We recommend that you rerun the Jenkins Deploy - OpenStack pipeline that runs on the Salt Master node with the same parameters as you have set initially during your environment deployment. This guarantees that your compute node will be properly set up and added.

Reprovision a compute node¶

Provisioning of compute nodes is relatively straightforward as you can run all states at once. Though, you need to run and reboot it multiple times for network configuration changes to take effect.

Note

Multiple reboots are needed because the ordering of dependencies is not yet orchestrated.

To reprovision a compute node:

Verify that the name of the cmp node is not registered in salt-key on the Salt Master node:
```
salt-key | grep 'cmp*'
```
If the node is shown in the above command output, remove it:
```
salt-key -d cmp<NUM>.domain_name
```
Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

Verify that the required nodes are defined in /classes/cluster/<cluster_name>/infra/config.yml.

Note

Create as many hosts as you have compute nodes in your environment within this file.

Configuration example if the dynamic compute host generation is used:

reclass:
  storage:
    node:
      openstack_compute_rack01:
        name: ${_param:openstack_compute_rack01_hostname}<<count>>
        domain: ${_param:cluster_domain}
        classes:
        - cluster.${_param:cluster_name}.openstack.compute
        repeat:
          count: 20
          start: 1
          digits: 3
          params:
            single_address:
              value: 172.16.47.<<count>>
              start: 101
            tenant_address:
              value: 172.16.47.<<count>>
              start: 101
        params:
          salt_master_host: ${_param:reclass_config_master}
          linux_system_codename: xenial

Configuration example if the static compute host generation is used:

reclass:
  storage:
    node:
      openstack_compute_node01:
        name: cmp01
        domain: ${_param:cluster_domain}
        classes:
        - cluster.${_param:cluster_name}.openstack.compute
        params:
          salt_master_host: ${_param:reclass_config_master}
          linux_system_codename: xenial
          single_address: 10.0.0.101
          deploy_address: 10.0.1.101
          tenant_address: 10.0.2.101

Apply the reclass.storage state on the Salt Master node to generate node definitions:
```
salt '*cfg*' state.sls reclass.storage
```
Verify that the target nodes have connectivity with the Salt Master node:
```
salt '*cmp<NUM>*' test.ping
```
Verify that the Salt Minion nodes are synchronized:
```
salt '*cmp<NUM>*' saltutil.sync_all
```
Apply the Salt highstate on the compute node(s):
```
salt '*cmp<NUM>*' state.highstate
```
Note

Failures may occur during the first run of highstate. Rerun the state until it is successfully applied.
Reboot the compute node(s) to apply network configuration changes.
Reapply the Salt highstate on the node(s):
```
salt '*cmp<NUM>*' state.highstate
```

Provision the vRouter on the compute node using CLI or the Contrail web UI. Example of the CLI command:

salt '*cmp<NUM>*' cmd.run '/usr/share/contrail-utils/provision_vrouter.py \
    --host_name <CMP_HOSTNAME> --host_ip <CMP_IP_ADDRESS> --api_server_ip <CONTRAIL_VIP> \
    --oper add --admin_user admin --admin_password <PASSWORD> \
    --admin_tenant_name admin --openstack_ip <OPENSTACK_VIP>'

Note

To obtain <CONTRAIL_VIP>, run salt-call pillar.get _param:keepalived_vip_address on any ntw node.
To obtain <OPENSTACK_VIP>, run salt-call pillar.get _param:keepalived_vip_address on any ctl node.

Remove a compute node¶

This section instructs you on how to safely remove a compute node from your OpenStack environment.

To remove a compute node:

Stop and disable the salt-minion service on the compute node you want to remove:
```
systemctl stop salt-minion
systemctl disable salt-minion
```
Verify that the name of the node is not registered in salt-key on the Salt Master node. If the node is present, remove it:
```
salt-key | grep cmp<NUM>
salt-key -d cmp<NUM>.domain_name
```
Log in to an OpenStack controller node.
Source the OpenStack RC file to set the required environment variables for the OpenStack command-line clients:
```
source keystonerv3
```

Disable the nova-compute service on the target compute node:

openstack compute service set --disable <cmp_host_name> nova-compute

Verify that Nova does not schedule new instances on the target compute node by viewing the output of the following command:
```
openstack compute service list
```
The command output should display the disabled status for the nova-compute service running on the target compute node.
Migrate your instances using the openstack server migrate command. You can perform live or cold migration.
Log in to the target compute node.

Stop the nova-compute service:

systemctl disable nova-compute
systemctl stop nova-compute

Log in to the OpenStack controller node.
Obtain the ID of the compute service to delete:
```
openstack compute service list
```
Delete the compute service substituting service_id with the value obtained in the previous step:
```
openstack compute service delete <service_id>
```
Select from the following options:
- For the deployments with OpenContrail:
  1. Log in to the target compute node.
  2. Stop the supervisor-vrouter service:
    service supervisor-vrouter disable service supervisor-vrouter stop
  3. Log in to the OpenContrail UI.
  4. Navigate to Configure > infrastracture > Virtual Routers.
  5. Select the target compute node.
  6. Click Delete.
- For the deployments with OVS:
  1. Stop the neutron-openvswitch-agent service:
    systemctl disable neutron-openvswitch-agent.service systemctl stop neutron-openvswitch-agent.service
  2. Obtain the ID of the target compute node agent:
    openstack network agent list
  3. Delete the network agent substituting cmp_agent_id with the value obtained in the previous step:
    openstack network agent delete <cmp_agent_id>

If you plan to replace the removed compute node with a new compute node with the same hostname, you need to manually clean up the resource provider record from the placement service using the curl tool:

Obtain the token ID from the openstack token issue command output. For example:

openstack token issue
+------------+-------------------------------------+
| Field      | Value                               |
+------------+-------------------------------------+
| expires    | 2018-06-22T10:30:17+0000            |
| id         | gAAAAABbLMGpVq2Gjwtc5Qqmp...        |
| project_id | 6395787cdff649cdbb67da7e692cc592    |
| user_id    | 2288ac845d5a4e478ffdc7153e389310    |
+------------+-------------------------------------+

Obtain the resource provider UUID of the target compute node:

curl -i -X GET <placement-endpoint-address>/resource_providers?name=<target-compute-host-name> -H \
'content-type: application/json' -H 'X-Auth-Token: <token>'

Susbtitute the following parameters as required:

placement-endpoint-address
The placement endpoint can be obtained from the openstack catalog list command output. A placement endpoint includes the scheme, endpoint address, and port, for example, http://10.11.0.10:8778. Depending on the deployment, you may need to specify the https scheme rather than http.
target-compute-host-name
The hostname of the compute node you are removing. For the correct hostname format to pass, see the Hypervisor Hostname column in the openstack hypervisor list command output.
token
The token id value obtained in the previous step.

Example of system response:

{
  "resource_providers": [
    {
      "generation": 1,
      "uuid": "08090377-965f-4ad8-9a1b-87f8e8153896",
      "links": [
        {
          "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896",
          "rel": "self"
        },
        {
          "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896/aggregates",
          "rel": "aggregates"
        },
        {
          "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896/inventories",
          "rel": "inventories"
        },
        {
          "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896/usages",
          "rel": "usages"
        }
      ],
      "name": "<compute-host-name>"
    }
  ]
}

Delete the resource provider record from the placement service substituting placement-endpoint-address, target-compute-node-uuid, and token with the values obtained in the previous steps:
```
curl -i -X DELETE <placement-endpoint-address>/resource_providers/<target-compute-node-uuid> -H \
'content-type: application/json' -H 'X-Auth-Token: <token>'
```

Log in to the Salt Master node.
Remove the compute node definition from the model in infra/config.yml under the reclass:storage:node pillar.
Remove the generated file for the removed compute node under /srv/salt/reclass/nodes/_generated.

Remove the compute node from StackLight LMA:

Update and clear the Salt mine:

salt -C 'I@salt:minion' state.sls salt.minion.grains
salt -C 'I@salt:minion' saltutil.refresh_modules
salt -C 'I@salt:minion' mine.update clear=true

Refresh the targets and alerts:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus -b 1

Reboot a compute node¶

This section instructs you on how to reboot an OpenStack compute node for a planned maintenance.

To reboot an OpenStack compute node:

Disable scheduling of new VMs to the node. Optionally provide a reason comment:

openstack compute service set --disable --disable-reason \
maintenance <compute_node_hostname> nova-compute

Migrate workloads from the OpenStack compute node:

nova host-evacuate-live <compute_node_hostname>

Log in to an OpenStack compute node.
Stop the nova-compute service:
```
service nova-compute stop
```
Shut down the OpenStack compute node, perform the maintenance, and turn the node back on.
Verify that the nova-compute service is up and running:
```
service nova-compute status
```

Perform the following steps from the OpenStack controller node:

Enable scheduling of VMs to the node:

openstack compute service set --enable <compute_node_hostname> nova-compute

Verify that the nova-compute service and neutron agents are running on the node:

openstack network agent list --host <compute_node_hostname>
openstack compute service list --host <compute_node_hostname>

The OpeStack compute service state must be up. The Neutron agent service state must be UP and the alive column must include :-).

Examples of a positive system response:

+----+--------------+------+------+---------+-------+----------------------------+
| ID | Binary       | Host | Zone | Status  | State | Updated At                 |
+----+--------------+------+------+---------+-------+----------------------------+
| 70 | nova-compute | cmp1 | nova | enabled | up    | 2020-09-17T08:51:07.000000 |
+----+--------------+------+------+---------+-------+----------------------------+

+----------+--------------------+------+-------------------+-------+-------+---------------------------+
| ID       | Agent Type         | Host | Availability Zone | Alive | State | Binary                    |
+----------+--------------------+------+-------------------+-------+-------+---------------------------+
| e4256d73 | Open vSwitch agent | cmp1 | None              | :-)   | UP    | neutron-openvswitch-agent |
+----------+--------------------+------+-------------------+-------+-------+---------------------------+

Optional. Migrate the instances back to their original OpenStack compute node.

Manage gateway nodes¶

This section describes how to manage tenant network gateway nodes that provide access to an external network for the environments configured with Neutron OVS as a networking solution.

Add a gateway node¶

The gateway nodes are hardware nodes that provide gateways and routers to the OVS-based tenant networks using network virtualization functions. Standard cloud configuration includes three gateway nodes. Though, you can scale the networking thoughput by adding more gateway servers.

This section explains how to increase the number of the gateway nodes in your cloud environment.

To add a gateway node:

Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

Define the gateway node in /classes/cluster/<cluster_name>/infra/config.yml. For example:

parameters:
  _param:
    openstack_gateway_node03_hostname: gtw03
    openstack_gateway_node03_tenant_address: <IP_of_gtw_node_tenant_address>
  reclass:
    storage:
      node:
        openstack_gateway_node03:
          name: ${_param:openstack_gateway_node03_hostname}
          domain: ${_param:cluster_domain}
          classes:
          - cluster.${_param:cluster_name}.openstack.gateway
          params:
            salt_master_host: ${_param:reclass_config_master}
            linux_system_codename: ${_param:linux_system_codename}
            single_address: ${_param:openstack_gateway_node03_address}
            tenant_address: ${_param:openstack_gateway_node03_tenant_address}

On the Salt Master node, generate node definitions by applying the reclass.storage state:
```
salt '*cfg*' state.sls reclass.storage
```
Verify that the target nodes have connectivity with the Salt Master node:
```
salt '*gtw<NUM>*' test.ping
```
Verify that the Salt Minion nodes are synchronized:
```
salt '*gtw<NUM>*' saltutil.sync_all
```
On the added node, verify that salt-common and salt-minion have the 2017.7 version.
```
apt-cache policy salt-common
apt-cache policy salt-minion
```
Note

If the commands above show a different version, follow the MCP Deployment guide: Install the correct versions of salt-common and salt-minion.

Perform the initial Salt configuration:

salt '*gtw<NUM>*' state.sls salt.minion

Set up the network interfaces and the SSH access:

salt '*gtw<NUM>*' state.sls linux.system.user,openssh,linux.network,ntp,neutron

Apply the highstate on the gateway node:
```
salt '*gtw<NUM>*' state.highstate
```

Reprovision a gateway node¶

If an tenant network gateway node is down, you may need to reprovision it.

To reprovision a gateway node:

Verify that the name of the gateway node is not registered in salt-key on the Salt Master node. If the node is present, remove it:
```
salt-key | grep gtw<NUM>
salt-key -d gtw<NUM>.domain_name
```
Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.
Verify that the required gateway node is defined in /classes/cluster/<cluster_name>/infra/config.yml.
Generate the node definition, by applying the reclass.storage state on the Salt Master node:
```
salt '*cfg*' state.sls reclass.storage
```
Verify that the target node has connectivity with the Salt Master node:
```
salt '*gtw<NUM>*' test.ping
```
Verify that the Salt Minion nodes are synchronized:
```
salt '*gtw<NUM>*' saltutil.sync_all
```
On the added node, verify that salt-common and salt-minion have the 2017.7 version.
```
apt-cache policy salt-common
apt-cache policy salt-minion
```
Note

If the commands above show a different version, follow the MCP Deployment guide: Install the correct versions of salt-common and salt-minion.

Perform the initial Salt configuration:

salt '*gtw<NUM>*' state.sls salt.minion

Set up the network interfaces and the SSH access:

salt '*gtw<NUM>*' state.sls linux.system.user,openssh,linux.network,ntp,neutron

Apply the Salt highstate on the gateway node:
```
salt '*gtw<NUM>*' state.highstate
```

Manage RabbitMQ¶

A RabbitMQ cluster is sensitive to external factors like network throughput and traffic spikes. When running under high load, it requires special start, stop, and restart procedures.

Restart a RabbitMQ node¶

Caution

We recommend that you do not restart a RabbitMQ node on a production environment by executing systemctl restart rabbitmq-server since a cluster can become inoperative.

To restart a single RabbitMQ node:

Gracefully stop rabbitmq-server on the target node:
```
systemctl stop rabbitmq-server
```

Verify that the node is removed from the cluster and RabbitMQ is stopped on this node:

rabbitmqctl cluster_status

Example of system response:

Cluster status of node rabbit@msg01
[{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
{running_nodes,[rabbit@msg03,rabbit@msg01]},  # <<< rabbit stopped on msg02
{cluster_name,<<"openstack">>},
{partitions,[]},
{alarms,[{rabbit@msg03,[]},{rabbit@msg01,[]}]}]

Start rabbitmq-server:
```
systemctl start rabbitmq-server
```

Restart a RabbitMQ cluster¶

To restart the whole RabbitMQ cluster:

Stop RabbitMQ on nodes one by one:

salt msg01* cmd.run 'systemctl stop rabbitmq-server'
salt msg02* cmd.run 'systemctl stop rabbitmq-server'
salt msg03* cmd.run 'systemctl stop rabbitmq-server'

Restart RabbitMQ in the reverse order:

salt msg03* cmd.run 'systemctl start rabbitmq-server'
salt msg02* cmd.run 'systemctl start rabbitmq-server'
salt msg01* cmd.run 'systemctl start rabbitmq-server'

Restart RabbitMQ with clearing the Mnesia database¶

To restart RabbitMQ with clearing the Mnesia database:

Stop RabbitMQ on nodes one by one:

salt msg01* cmd.run 'systemctl stop rabbitmq-server'
salt msg02* cmd.run 'systemctl stop rabbitmq-server'
salt msg03* cmd.run 'systemctl stop rabbitmq-server'

Remove the Mnesia database on all nodes:

salt msg0* cmd.run 'rm -rf /var/lib/rabbitmq/mnesia/'

Apply the rabbitmq state on the first RabbitMQ node:
```
salt msg01* state.apply rabbitmq
```
Apply the rabbitmq state on the remaining RabbitMQ nodes:
```
salt -C "msg02* or msg03*" state.apply rabbitmq
```

Switch to nonclustered RabbitMQ¶

Note

This feature is available starting from the MCP 2019.2.13 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Note

This feature is available for OpenStack Queens and from the RabbitMQ version 3.8.2.

You can switch clustered RabbitMQ to a nonclustered configuration. Mirantis recommends using such approach only to improve the stability and performance on large deployments in case the clustered configuration causes issues.

Switch to nonclustered RabbitMQ¶

This section instructs you on how to switch clustered RabbitMQ to a nonclustered configuration.

Note

This feature is available starting from the MCP 2019.2.13 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Caution

This feature is available for OpenStack Queens and from the RabbitMQ version 3.8.2.
The procedure below applies only to environments without manual changes in the configuration files of OpenStack services. The procedure applies all OpenStack states to all OpenStack nodes and implies that any state can apply without any errors before starting the maintenance.

To switch RabbitMQ to a nonclustered configuration:

Perform the following prerequisite steps:

Log in to the Salt Master node.
Verify that the salt-formula-nova version is 2016.12.1+202101271624.d392d41~xenial1 or newer:
```
dpkg -l |grep salt-formula-nova
```
Verify that the salt-formula-oslo-templates version is 2018.1+202101191343.e24fd64~xenial1 or newer.
```
dpkg -l |grep salt-formula-oslo-templates
```

Create /root/non-clustered-rabbit-helpers.sh with the following content:

#!/bin/bash

# Apply all known openstack states on given target
# example: run_openstack_states ctl*
function run_openstack_states {
  local target="$1"
  all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
  #List of nodes in cloud
  list_nodes=`salt -C "$target" test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
  for node in $list_nodes; do
    #List of applications on the given node
    node_applications=$(salt $node pillar.items __reclass__:applications --out=json | jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
    for component in $all_formulas ; do
      if [[ " ${node_applications[*]} " == *"$component"* ]]; then
        echo "Applying state: $component on the $node"
        salt $node state.apply $component
      fi
    done
  done
}

# Apply specified update state for all OpenStack applications on given target
# example: run_openstack_update_states ctl0* upgrade.verify
# will run {nova|glance|cinder|keystone}.upgrade.verify on ctl01
function run_openstack_update_states {
  local target="$1"
  local state="$2"
  all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
  #List of nodes in cloud
  list_nodes=`salt -C "$target" test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
  for node in $list_nodes; do
    #List of applications on the given node
    node_applications=$(salt $node pillar.items __reclass__:applications --out=json | jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
    for component in $all_formulas ; do
      if [[ " ${node_applications[*]} " == *"$component"* ]]; then
        echo "Applying state: $component.${state} on the $node"
        salt $node state.apply $component.${state}
      fi
    done
  done
}

Run simple API checks for ctl01*. The output should not include errors.

. /root/non-clustered-rabbit-helpers.sh
run_openstack_update_states ctl01* upgrade.verify

Open your project Git repository with the Reclass model on the cluster level.
Prepare the Neutron server for the RabbitMQ reconfiguration:
1. In openstack/control.yml, specify the allow_automatic_dhcp_failover parameter as required.
  
  Caution
  
  If set to true, the server reschedules the nets from the failed DHCP agents so that the alive agents catch up the net and serve DHCP. Once the agent reconnects to RabbitMQ, it detects that its net has been rescheduled and removes the DHCP port, namespace, and flows. This parameter is useful if the entire gateway node goes down. In case of an unstable RabbitMQ, agents do not go down and the data plane is not affected. Therefore, we recommend that you set the allow_automatic_dhcp_failover parameter to false. However, consider the risks of a gateway node going down before setting the allow_automatic_dhcp_failover parameter.
```
neutron:
  server:
    allow_automatic_dhcp_failover: false
```
2. Apply the changes:
```
salt -C 'I@neutron:server' state.apply neutron.server
```
3. Verify the changes:
```
salt -C 'I@neutron:server' cmd.run "grep allow_automatic_dhcp_failover /etc/neutron/neutron.conf"
```

Perform the following changes in the Reclass model on the cluster level:

In infra/init.yml, add the following variable:

parameters:
  _param:
    openstack_rabbitmq_standalone_mode: true

In openstack/message_queue.yml, comment the following class:
```
classes:
#- system.rabbitmq.server.cluster
```

In openstack/message_queue.yml, add the following classes:

classes:
- system.keepalived.cluster.instance.rabbitmq_vip
- system.rabbitmq.server.single

If your deployment has OpenContrail, add the following variables:

In opencontrail/analytics.yml, add:

parameters:
  opencontrail:
    collector:
      message_queue:
        ~members:
          - host: ${_param:openstack_message_queue_address}

In opencontrail/control.yml, add:

parameters:
  opencontrail:
    config:
      message_queue:
        ~members:
          - host: ${_param:openstack_message_queue_address}
    control:
      message_queue:
        ~members:
          - host: ${_param:openstack_message_queue_address}

To update the cells database when running Nova states, add the following variable to openstack/control.yml:
```
parameters:
  nova:
    controller:
      update_cells: true
```

Refresh pillars on all nodes:

salt '*' saltutil.sync_all; salt '*' saltutil.refresh_pillar

Verify that the messaging variables are set correctly:

Note

The following validation highlights the output for core OpenStack services only. Validate any additional deployed services appropriately.

For Keystone:

salt -C 'I@keystone:server' pillar.items keystone:server:message_queue:use_vip_address keystone:server:message_queue:host

For Heat:

salt -C 'I@heat:server' pillar.items heat:server:message_queue:use_vip_address heat:server:message_queue:host

For Cinder:

salt -C 'I@cinder:controller' pillar.items cinder:controller:message_queue:use_vip_address cinder:controller:message_queue:host

For Glance:

salt -C 'I@glance:server' pillar.items glance:server:message_queue:use_vip_address glance:server:message_queue:host

For Nova:

salt -C 'I@nova:controller' pillar.items nova:controller:message_queue:use_vip_address nova:controller:message_queue:host

For the OpenStack compute nodes:

salt -C 'I@nova:compute' pillar.items nova:compute:message_queue:use_vip_address nova:compute:message_queue:host

For Neutron:

salt -C 'I@neutron:server' pillar.items neutron:server:message_queue:use_vip_address neutron:server:message_queue:host
salt -C 'I@neutron:gateway' pillar.items neutron:gateway:message_queue:use_vip_address neutron:gateway:message_queue:host

If your deployment has OpenContrail:

salt 'ntw01*' pillar.items opencontrail:config:message_queue:members opencontrail:control:message_queue:members
salt 'nal01*' pillar.items opencontrail:collector:message_queue:members

Apply the changes:

Stop the OpenStack control plane services on the ctl nodes:

. /root/non-clustered-rabbit-helpers.sh
run_openstack_update_states ctl* upgrade.service_stopped

Stop the OpenStack services on the gtw nodes. Skip this step if your deployment has OpenContrail or does not have gtw nodes.
```
. /root/non-clustered-rabbit-helpers.sh
run_openstack_update_states gtw* upgrade.service_stopped
```

Reconfigure the Keepalived and RabbitMQ clusters on the msg nodes:

Verify that the rabbitmq:cluster pillars are not present:

salt -C 'I@rabbitmq:server' pillar.items rabbitmq:cluster

Verify that the haproxy pillars are not present:

salt -C 'I@rabbitmq:server' pillar.item haproxy

Remove HAProxy, HAProxy monitoring, and reconfigure Keepalived:

salt -C 'I@rabbitmq:server' cmd.run "export DEBIAN_FRONTEND=noninteractive; apt purge haproxy -y"
salt -C 'I@rabbitmq:server' state.apply telegraf
salt -C 'I@rabbitmq:server' state.apply keepalived

Verify that a VIP address is present on one of the msg nodes:

OPENSTCK_MSG_Q_ADDRESS=$(salt msg01* pillar.items _param:openstack_message_queue_address --out json|jq '.[][]')
salt -C 'I@rabbitmq:server' cmd.run "ip addr |grep $OPENSTCK_MSG_Q_ADDRESS"

Stop the RabbitMQ server, clear mnesia, and reconfigure rabbitmq-server:

salt -C 'I@rabbitmq:server' cmd.run 'systemctl stop rabbitmq-server'
salt -C 'I@rabbitmq:server' cmd.run 'rm -rf /var/lib/rabbitmq/mnesia/'
salt -C 'I@rabbitmq:server' state.apply rabbitmq

Verify that the RabbitMQ server is running in a nonclustered configuration:

salt -C 'I@rabbitmq:server' cmd.run "rabbitmqctl --formatter=erlang cluster_status |grep running_nodes"

Example of system response:

msg01.heat-cicd-queens-dvr-sl.local:
     {running_nodes,[rabbit@msg01]},
msg03.heat-cicd-queens-dvr-sl.local:
     {running_nodes,[rabbit@msg03]},
msg02.heat-cicd-queens-dvr-sl.local:
     {running_nodes,[rabbit@msg02]},

Reconfigure OpenStack services on the ctl nodes:

Apply all OpenStack states on ctl nodes:

. /root/non-clustered-rabbit-helpers.sh
run_openstack_states ctl*

Verify transport_url for the OpenStack services on the ctl nodes:

salt 'ctl*' cmd.run "for s in nova glance cinder keystone heat neutron; do if [[ -d "/etc/\$s" ]]; then grep ^transport_url /etc/\$s/*.conf; fi; done" shell=/bin/bash

Verify that the cells database is updated and transport_url has a VIP address:

salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; nova-manage cell_v2 list_cells"

Reconfigure RabbitMQ on the gtw nodes. Skip this step if your deployment has OpenContrail or does not have gtw nodes.

Apply all OpenStack states on the gtw nodes:

. /root/non-clustered-rabbit-helpers.sh
run_openstack_states gtw*

Verify transport_url for the OpenStack services on the gtw nodes:

salt 'gtw*' cmd.run "for s in nova glance cinder keystone heat neutron; do if [[ -d "/etc/\$s" ]]; then grep ^transport_url /etc/\$s/*.conf; fi; done" shell=/bin/bash

Verify that the agents are up:

salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack orchestration service list"

If your deployment has OpenContrail, reconfigure RabbitMQ on the ntw and nal nodes:

Apply the following state on the ntw and nal nodes:

salt -C 'ntw* or nal*' state.apply opencontrail

Verify transport_url for the OpenStack services on the ntw and nal nodes:

salt -C 'ntw* or nal*' cmd.run "for s in contrail; do if [[ -d "/etc/\$s" ]]; then grep ^rabbitmq_server_list /etc/\$s/*.conf; fi; done" shell=/bin/bash
salt 'ntw*' cmd.run "for s in contrail; do if [[ -d "/etc/\$s" ]]; then grep ^rabbit_server /etc/\$s/*.conf; fi; done" shell=/bin/bash

Verify the OpenContrail status:

salt -C 'ntw* or nal*' cmd.run 'doctrail all contrail-status'

Reconfigure OpenStack services on the cmp nodes:

Apply all OpenStack states on the cmp nodes:

. /root/non-clustered-rabbit-helpers.sh
run_openstack_states cmp*

Verify transport_url for the OpenStack services on the cmp nodes:

salt 'cmp*' cmd.run "for s in nova glance cinder keystone heat neutron; do if [[ -d "/etc/\$s" ]]; then grep ^transport_url /etc/\$s/*.conf; fi; done" shell=/bin/bash

Caution

If your deployment has other nodes with OpenStack services, apply the changes on such nodes as well using the required states.

Verify the services:

Verify that the Neutron services are up. Skip this step if your deployment has OpenContrail.

salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack network agent list"

Verify that the Nova services are up:

salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack compute service list"

Verify that Heat services are up:

salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack orchestration service list"

Verify that the Cinder services are up:

salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack volume service list"

From the ctl01* node, apply the <app>.upgrade.verify state. The output should not include errors.

. /root/non-clustered-rabbit-helpers.sh
run_openstack_update_states ctl01* upgrade.verify

Perform post-configuration steps:
1. Disable the RabbitMQUnequalQueueCritical Prometheus alert:
  1. In stacklight/server.yml, add the following variable:
    parameters: prometheus: server: alert: RabbitMQUnequalQueueCritical: enabled: false
  2. Apply the Prometheus state to the mon nodes:
    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
2. Revert the changes in the Reclass model on the cluster level:
  1. In openstack/control.yaml, set allow_automatic_dhcp_failover back to true or leave as is if you did not change the value.
  2. In openstack/control.yaml, remove nova:controller:update_cells:true.
3. Apply the Neutron state:
```
salt -C 'I@neutron:server' state.apply neutron.server
```
4. Verify the changes:
```
salt -C 'I@neutron:server' cmd.run "grep allow_automatic_dhcp_failover /etc/neutron/neutron.conf"
```
5. Remove the script:
```
rm -f /root/non-clustered-rabbit-helpers.sh
```

See also

Roll back to clustered RabbitMQ

Roll back to clustered RabbitMQ¶

This section instructs you on how to roll back RabbitMQ to a clustered configuration after switching it to a nonclustered configuration as described in Switch to nonclustered RabbitMQ.

Note

After performing the rollback procedure, you may notice a number of down heat-engine instances of a previous version among the heat-engine running instances. Such behavior is abnormal but expected. Verify the Updated At field of the running instances of heat-engine. Ignore the stopped(down) instances of heat-engine.

To roll back RabbitMQ to a clustered configuration:

If you have removed the non-clustered-rabbit-helpers.sh script, create it again as described in Switch to nonclustered RabbitMQ.
Revert the changes performed in the cluster model in the step 2 during Switch to nonclustered RabbitMQ. Use git stash, for example, if you did not commit the changes.

From the Salt Master node, refresh pillars on all nodes:

salt '*' saltutil.sync_all; salt '*' saltutil.refresh_pillar

Roll back the changes on the RabbitMQ nodes:

salt -C 'I@rabbitmq:server' cmd.run 'systemctl stop rabbitmq-server'
salt -C 'I@rabbitmq:server' cmd.run 'rm -rf /var/lib/rabbitmq/mnesia/'

salt -C 'I@rabbitmq:server' state.apply keepalived
salt -C 'I@rabbitmq:server' state.apply haproxy
salt -C 'I@rabbitmq:server' state.apply telegraf
salt -C 'I@rabbitmq:server' state.apply rabbitmq

Verify that the RabbitMQ server is running in a clustered configuration:

salt -C 'I@rabbitmq:server' cmd.run "rabbitmqctl --formatter=erlang cluster_status |grep running_nodes"

Example of system response:

msg01.heat-cicd-queens-dvr-sl.local:
     {running_nodes,[rabbit@msg02,rabbit@msg03,rabbit@msg01]},
msg02.heat-cicd-queens-dvr-sl.local:
     {running_nodes,[rabbit@msg01,rabbit@msg03,rabbit@msg02]},
msg03.heat-cicd-queens-dvr-sl.local:
     {running_nodes,[rabbit@msg02,rabbit@msg01,rabbit@msg03]},

Roll back the changes on other nodes:
1. Roll back the changes on the ctl nodes:
```
. /root/non-clustered-rabbit-helpers.sh
run_openstack_states ctl*
```
2. Roll back changes on the gtw nodes. Skip this step if your deployment has OpenContrail or does not have gtw nodes.
```
. /root/non-clustered-rabbit-helpers.sh
run_openstack_states gtw*
```
If your environment has OpenContrail, roll back the changes on the ntw and nal nodes:
```
salt -C 'ntw* or nal*' state.apply opencontrail
```

Roll back the changes on the cmp nodes:

. /root/non-clustered-rabbit-helpers.sh
run_openstack_states cmp*

Enable queue mirroring¶

Note

This feature is available starting from the MCP 2019.2.15 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Mirroring policy enables RabbitMQ to mirror the queues content to an additional RabbitMQ node in the RabbitMQ cluster. Such approach reduces failures during the RabbitMQ cluster recovery.

Warning

This feature is of use only for clustered RabbitMQ configurations.
Enabling mirroring for queues and exchanges in RabbitMQ may increase the message passing latency and prolong the RabbitMQ cluster recovery after a network partition. Therefore, we recommend accomplishing the procedure on a staging environment before applying it to production.

To enable queue mirroring:

Open your project Git repository with the Reclass model on the cluster level.

In <cluster_name>/openstack/message_queue.yml, specify ha_exactly_ttl_120 in classes:

classes:
...
- system.rabbitmq.server.vhost.openstack
- system.rabbitmq.server.vhost.openstack.without_rpc_ha
- system.rabbitmq.server.vhost.openstack.ha_exactly_ttl_120
...

Apply the rabbitmq.server state:

salt -C 'I@rabbitmq:server and *01*' state.sls rabbitmq.server

Randomize RabbitMQ reconnection intervals¶

Note

This feature is available starting from the MCP 2019.2.15 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can randomize RabbitMQ reconnection intervals (or timeouts) for the required OpenStack services. It is helpful for large OpenStack environments where a simultaneous reconnection of all OpenStack services after a RabbitMQ cluster partitioning can significantly prolong the RabbitMQ cluster recovery or cause the cluster to enter the split-brain mode.

Using this feature, the following OpenStack configuration options will be randomized:

kombu_reconnect_delay - from 30 to 60 seconds
rabbit_retry_interval - from 10 to 60 seconds
rabbit_retry_backoff - from 30 to 60 seconds
rabbit_interval_max - from 60 to 180 seconds

To randomize RabbitMQ reconnection intervals :

Open your project Git repository with the Reclass model on the cluster level.
Open the configuration file of the required OpenStack service. For example, for the OpenStack Compute service (Nova), open <cluster_name>/openstack/compute/init.yml.

Under message_queue, specify rabbit_timeouts_random: True:

parameters:
  nova:
    compute:
      message_queue:
        rabbit_timeouts_random: True

Log in to the Salt Master node.
Apply the corresponding OpenStack service state(s). For example, for the OpenStack Compute service (Nova), apply the following state:
```
salt -C 'I@nova:compute' state.sls nova.compute
```
Note

Each service configured with this feature on every node will receive new unique timeouts on every run of the corresponding OpenStack service Salt state.
Perform the steps 2-5 for other OpenStack services as required.

Remove a node¶

Removal of a node from a Salt-managed environment is a matter of disabling the salt-minion service running on the node, removing its key from the Salt Master node, and updating the services so that they know that the node is not available anymore.

To remove a node:

Stop and disable the salt-minion service on the node you want to remove:
```
systemctl stop salt-minion
systemctl disable salt-minion
```
Verify that the name of the node is not registered in salt-key on the Salt Master node. If the node is present, remove it:
```
salt-key | grep <nodename><NUM>
salt-key -d <nodename><NUM>.domain_name
```
Update your Reclass metadata model to remove the node from services. Apply the necessary Salt states. This step is generic as different services can be involved depending on the node being removed.

See also

Remove a compute node

Manage certificates¶

After you deploy an MCP cluster, you can renew your expired certificates or replace them by the endpoint certificates provided by a customer as required. When you renew a certificate, its key remains the same. When you replace a certificate, a new certificate key is added accordingly.

You can either push certificates from pillars or regenerate them as follows:

Generate and update by salt-minion (signed by salt-master)
Generate and update by external certificate authorities, for example, by Let’s Encrypt

Certificates generated by salt-minion can be renewed by the salt-minion state. The renewal operation becomes available within 30 days before the expiration date. This is controlled by the days_remaining parameter of the x509.certificate_managed Salt state. Refer to Salt.states.x509 for details.

You can force renewal of certificates by removing old certificates and running salt.minion.cert state on each target node.

Publish CA certificates¶

If you use certificates issued by Certificate Authorities that are not recognized by an operating system, you must publish them.

To publish CA certificates:

Open your project Git repository with the Reclass model on the cluster level.

Create the /infra/ssl/init.yml file with the following configuration as an example:

parameters:
  linux:
    system:
      ca_certificates:
        ca-salt_master_ca: |
          -----BEGIN CERTIFICATE-----
          MIIGXzCCBEegAwIBAgIDEUB0MA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
          ...
          YqQO
          -----END CERTIFICATE-----
        ca-salt_master_ca_old: |
          -----BEGIN CERTIFICATE-----
          MIIFgDCCA2igAwIBAgIDET0sMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
          ...
          WzUuf8H9dBW2DPtk5Jq/+QWtYMs=
          -----END CERTIFICATE-----

To publish the certificates on all nodes managed by Salt, update /infra/init.yml by adding the newly created class:
```
classes:
- cluster.<cluster_name>.infra.ssl
```

To publish the certificates on a specific node, update /infra/config.yml. For example:

parameters:
  reclass:
    storage:
      node:
        openstack_control_node01:
          classes:
          - cluster.${_param:cluster_name}.openstack.ssl

Update the Reclass storage:

salt-call state.sls reclass.storage -l debug

Apply the linux.system state on all nodes:

salt \* state.sls linux.system.certificate -l debug

NGINX certificates¶

This section describes how to renew or replace the NGINX certificates managed by either salt-minion or self-managed certificates using pillars. For both cases, you must verify the GlusterFS share salt_pki before renewal.

Renew or replace the NGINX certificates managed by salt-minion¶

This section describes how to renew or replace the NGINX certificates managed by salt-minion.

To renew or replace the NGINX certificates managed by salt-minion:

Complete the steps described in Verify the GlusterFS share salt_pki.
Log in to the Salt Master node.

Verify the certificate validity date:

openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'

Example of system response:

Not Before: May 30 17:21:10 2018 GMT
Not After : May 30 17:21:10 2019 GMT

Remove your current certificates from the Salt Master node.

Note

The following command also removes certificates from all proxy nodes as they use the same GlusterFS share.
```
rm -f /srv/salt/pki/*/*.[pemcrt]*
```
If you replace the certificates, remove the private key:
```
rm -f /srv/salt/pki/*/proxy.key
```
Renew or replace your certificates by applying the salt.minion state on all proxy nodes one by one:
```
salt -C 'I@nginx:server:site:*:host:protocol:https' state.sls salt.minion.cert -b 1
```

Apply the nginx state on all proxy nodes one by one:

salt -C 'I@nginx:server:site:*:host:protocol:https' state.sls nginx -b 1

Verify the new certificate validity date:

openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'

Example of system response:

Not Before: May 30 17:21:10 2018 GMT
Not After : May 30 17:21:10 2019 GMT

Renew the self-managed NGINX certificates¶

This section describes how to renew the self-managed NGINX certificates.

To renew the self-managed NGINX certificates:

Complete the steps described in Verify the GlusterFS share salt_pki.
Open your project Git repository with the Reclass model on the cluster level.

Update the /openstack/proxy.yml file with the following configuration as an example:

parameters:
  _params:
    nginx_proxy_ssl:
      enabled: true
      mode: secure
      key_file:  /srv/salt/pki/${_param:cluster_name}/FQDN_PROXY_CERT.key
      cert_file: /srv/salt/pki/${_param:cluster_name}/FQDN_PROXY_CERT.crt
      chain_file: /srv/salt/pki/${_param:cluster_name}/FQDN_PROXY_CERT_CHAIN.crt
      key: |
        -----BEGIN PRIVATE KEY-----
        MIIJRAIBADANBgkqhkiG9w0BAQEFAASCCS4wggkqAgEAAoICAQC3qXiZiugf6HlR
        ...
        aXK0Fg1hJKu60Oh+E5H1d+ZVbP30xpdQ
        -----END PRIVATE KEY-----
      cert: |
        -----BEGIN CERTIFICATE-----
        MIIHDzCCBPegAwIBAgIDLYclMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
        ...
        lHfjP1c6iWAL0YEp1IMCeM01l4WWj0ymb7f4wgOzcULfwzU=
        -----END CERTIFICATE-----
      chain: |
        -----BEGIN CERTIFICATE-----
        MIIFgDCCA2igAwIBAgIDET0sMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
        ...
        UPwFzYIVkwy4ny+UJm9js8iynKro643mXty9vj5TdN1iK3ZA4f4/7kenuHtGBNur
        WzUuf8H9dBW2DPtk5Jq/+QWtYMs=
        -----END CERTIFICATE-----
        -----BEGIN CERTIFICATE-----
        MIIGXzCCBEegAwIBAgIDEUB0MA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
        ...
        /inxvBr89TvbCP2hweGMD6w1mKJU2SWEQwMs7P72dU7VuVqyyoutMWakJZ+xoGE9
        YqQO
        -----END CERTIFICATE-----
        -----BEGIN CERTIFICATE-----
        MIIHDzCCBPegAwIBAgIDLYclMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
        ...
        lHfjP1c6iWAL0YEp1IMCeM01l4WWj0ymb7f4wgOzcULfwzU=
        -----END CERTIFICATE-----

Note

Modify the example above by adding your certificates and key:

If you renew the certificates, leave your existing key and update the cert and chain sections.
If you replace the certificates, modify all three sections.

Note

The key, cert, and chain sections are optional. You can select from the following options:

Store certificates in the file system in /srv/salt/pki/**/ and add the key_file, cert_file, and chain_file lines to /openstack/proxy.yml.
Add only the key, cert, and chain sections without the key_file, cert_file, and chain_file lines to /openstack/proxy.yml. The certificates are stored under the /etc directory as default paths in the Salt formula.
Use all three sections, as in the example above. All content is available in pillar and is stored in /srv/salt/pki/** as well. This option requires manual upload of the certificates and key files content to the .yml files.

Verify the new certificate validity date:

openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'

Example of system response:

Not Before: May 30 17:21:10 2018 GMT
Not After : May 30 17:21:10 2019 GMT

Remove the current certificates.

Note

The following command also removes certificates from all proxy nodes as they use the same GlusterFS share.
```
rm -f /srv/salt/pki/*/*.[pemcrt]*
```
If you replace the certificates, remove the private key:
```
/srv/salt/pki/*/proxy.key
```

Apply the nginx state on all proxy nodes one by one:

salt -C 'I@nginx:server' state.sls nginx -b 1

Verify the new certificate validity date:

openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'

Example of system response:

Not Before: May 30 17:21:10 2018 GMT
Not After : May 30 17:21:10 2019 GMT

Restart the NGINX services and remove the VIP before restart:

salt -C 'I@nginx:server' cmd.run 'service keepalived stop; sleep 5; \
service nginx restart; service keepalived start' -b 1

HAProxy certificates¶

This section describes how to renew or replace the HAProxy certificates managed by either salt-minion or self-managed certificates using pillars.

Renew or replace the HAProxy certificates managed by salt-minion¶

This section describes how to renew or replace the HAProxy certificates managed by salt-minion.

To renew or replace the HAProxy certificates managed by salt-minion:

Obtain the list of the HAProxy minions IDs where the certificate should be replaced:

salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
pillar.get _nonexistent | cut -d':' -f1

Example of system response:

cid02.multinode-ha.int
cid03.multinode-ha.int
cid01.multinode-ha.int

Verify the certificate validity date for each HAProxy minion listed in the output of the above command:

for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | \
cut -d"'" -f4 | sort | uniq | tr '\n' ' '); do salt -C ${m} \
cmd.run "openssl x509 -in ${c} -text | egrep -i 'after|before'"; done; done;

Example of system response:

cid02.multinode-ha.int:
                Not Before: May 29 12:58:21 2018 GMT
                Not After : May 29 12:58:21 2019 GMT

Remove your current certificates from each HAProxy minion:

for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
-f4 | sort | uniq | sed s/-all.pem/.crt/ | tr '\n' ' '); \
do salt -C ${m} cmd.run "rm -f ${c}"; done; done; \
for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
-f4 | sort | uniq | tr '\n' ' '); do salt -C ${m} cmd.run "rm -f ${c}"; done; done; \
salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
cmd.run 'rm -f /etc/haproxy/ssl/salt_master_ca-ca.crt'

If you replace the certificates, remove the private key:

for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
-f4 | sort | uniq | sed s/-all.pem/.key/ | tr '\n' ' '); \
do salt -C ${m} cmd.run "rm -f ${c}"; done; done;

Apply the salt.minion.grains state for all HAProxy nodes to retrieve the CA certificate from Salt Master:
```
salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' state.sls salt.minion.grains
```

Apply the salt.minion.cert state for all HAProxy nodes:

salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' state.sls salt.minion.cert

Verify the new certificate validity date:

for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
-f4 | sort | uniq | tr '\n' ' '); do salt -C ${m} \
cmd.run "openssl x509 -in ${c} -text | egrep -i 'after|before'"; done; done;

Example of system response:

cid02.multinode-ha.int:
                Not Before: Jun  6 17:24:09 2018 GMT
                Not After : Jun  6 17:24:09 2019 GMT

Restart the HAProxy services on each HAProxy minion and remove the VIP before restart:

salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
cmd.run 'service keepalived stop; sleep 5; \
service haproxy stop; service haproxy start; service keepalived start' -b 1

Renew or replace the self-managed HAProxy certificates¶

This section describes how to renew or replace the self-managed HAProxy certificates.

To renew or replace the self-managed HAProxy certificates:

Verify the certificate validity date:

for node in $(salt -C 'I@haproxy:proxy' test.ping --output yaml | cut -d':' -f1); do
  for name in $(salt ${node} pillar.get haproxy:proxy --output=json | jq '.. \
  | .listen? | .. | .ssl? | .pem_file?' | grep -v null | sort | uniq); do
    salt ${node} cmd.run "openssl x509 -in ${name} -text -noout | grep -Ei 'after|before'";
  done;
done;

Note

In the command above, the pem_file value is used to specify the explicit certificate path.

Example of system response:

cid02.multinode-ha.int:
                Not Before: May 25 15:32:17 2018 GMT
                Not After : May 25 15:32:17 2019 GMT
cid01.multinode-ha.int:
                Not Before: May 25 15:29:17 2018 GMT
                Not After : May 25 15:29:17 2019 GMT
cid03.multinode-ha.int:
                Not Before: May 25 15:21:17 2018 GMT
                Not After : May 25 15:21:17 2019 GMT

Open your project Git repository with Reclass model on the cluster level.

For each class file with the HAProxy class enabled, update its pillar values with the following configuration as an example:

parameters:
  _params:
    haproxy_proxy_ssl:
      enabled: true
      mode: secure
      key: |
        -----BEGIN RSA PRIVATE KEY-----
        MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
        ...
        0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
        -----END RSA PRIVATE KEY-----
      cert: |
        -----BEGIN CERTIFICATE-----
        MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
        ...
        /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
        -----END CERTIFICATE-----
      chain: |
        -----BEGIN RSA PRIVATE KEY-----
        MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
        ...
        0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
        -----END RSA PRIVATE KEY-----
        -----BEGIN CERTIFICATE-----
        MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
        ...
        /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
        -----END CERTIFICATE-----
        -----BEGIN CERTIFICATE-----
        MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
        ...
        M8IfJ5I=
        -----END CERTIFICATE-----

Note

Modify the example above by adding your certificates and key:

If you renew the certificates, leave your existing key and update the cert and chain sections.
If you replace the certificates, modify all three sections.

Remove your current certificates from the HAProxy nodes:

for node in $(salt -C 'I@haproxy:proxy' test.ping --output yaml | cut -d':' -f1); do
  for name in $(salt ${node} pillar.get haproxy:proxy --output=json | jq '.. \
  | .listen? | .. | .ssl? | .pem_file?' | grep -v null | sort | uniq); do
    salt ${node} cmd.run "rm -f ${name}";
  done;
done;

Apply the haproxy.proxy state on all HAProxy nodes one by one:

salt -C 'I@haproxy:proxy' state.sls haproxy.proxy -b 1

Verify the new certificate validity date:

for node in $(salt -C 'I@haproxy:proxy' test.ping --output yaml | cut -d':' -f1); do
  for name in $(salt ${node} pillar.get haproxy:proxy --output=json | jq '.. \
  | .listen? | .. | .ssl? | .pem_file?' | grep -v null | sort | uniq); do
    salt ${node} cmd.run "openssl x509 -in ${name} -text -noout | grep -Ei 'after|before'";
  done;
done;

Example of system response:

cid02.multinode-ha.int:
                Not Before: May 25 15:29:17 2018 GMT
                Not After : May 25 15:29:17 2019 GMT
cid03.multinode-ha.int:
                Not Before: May 25 15:29:17 2018 GMT
                Not After : May 25 15:29:17 2019 GMT
cid01.multinode-ha.int:
                Not Before: May 25 15:29:17 2018 GMT
                Not After : May 25 15:29:17 2019 GMT

Restart the HAProxy services one by one and remove the VIP before restart:

salt -C 'I@haproxy:proxy' cmd.run 'service keepalived stop; sleep 5; \
service haproxy stop; service haproxy start; service keepalived start' -b 1

Apache certificates¶

This section describes how to renew or replace the Apache certificates managed by either salt-minion or self-managed certificates using pillars.

Renew or replace the Apache certificates managed by salt-minion¶

This section describes how to renew or replace the Apache certificates managed by salt-minion.

Warning

If you replace or renew the Apache certificates after the Salt Master CA certificate has been replaced, make sure that both new and old CA certificates are published as described in Publish CA certificates.

To renew or replace the Apache certificates managed by salt-minion:

Verify your current certificate validity date:

salt -C 'I@apache:server' cmd.run 'openssl x509 \
-in /etc/ssl/certs/internal_proxy.crt -text -noout | grep -Ei "after|before"'

Example of system response:

ctl02.multinode-ha.int:
                Not Before: May 29 12:58:21 2018 GMT
                Not After : May 29 12:58:21 2019 GMT
ctl03.multinode-ha.int:
                Not Before: May 29 12:58:25 2018 GMT
                Not After : May 29 12:58:25 2019 GMT
ctl01.multinode-ha.int:
                Not Before: Apr 27 12:37:28 2018 GMT
                Not After : Apr 27 12:37:28 2019 GMT

Remove your current certificates from the Apache nodes:

salt -C 'I@apache:server' cmd.run 'rm -f /etc/ssl/certs/internal_proxy.crt'

If you replace the certificates, remove the private key:

salt -C 'I@apache:server' cmd.run 'rm -f /etc/ssl/private/internal_proxy.key'

Renew or replace your certificates by applying the salt.minion.cert state on all Apache nodes one by one:
```
salt -C 'I@apache:server' state.sls salt.minion.cert
```

Refresh the CA chain:

salt -C 'I@apache:server' cmd.run 'cat /etc/ssl/certs/internal_proxy.crt \
/usr/local/share/ca-certificates/ca-salt_master_ca.crt > \
/etc/ssl/certs/internal_proxy-with-chain.crt; \
chmod 0644 /etc/ssl/certs/internal_proxy-with-chain.crt; \
chown root:root /etc/ssl/certs/internal_proxy-with-chain.crt'

Verify the new certificate validity date:

salt -C 'I@apache:server' cmd.run 'openssl x509 \
-in /etc/ssl/certs/internal_proxy.crt -text -noout | grep -Ei "after|before"'

Example of system response:

ctl02.multinode-ha.int:
                Not Before: Jun  6 17:24:09 2018 GMT
                Not After : Jun  6 17:24:09 2019 GMT
ctl03.multinode-ha.int:
                Not Before: Jun  6 17:24:42 2018 GMT
                Not After : Jun  6 17:24:42 2019 GMT
ctl01.multinode-ha.int:
                Not Before: Jun  6 17:23:38 2018 GMT
                Not After : Jun  6 17:23:38 2019 GMT

Restart the Apache services one by one:

salt -C 'I@apache:server' cmd.run 'service apache2 stop; service apache2 start; sleep 60' -b1

Replace the self-managed Apache certificates¶

This section describes how to replace the self-managed Apache certificates.

Warning

To replace the self-managed Apache certificates:

Verify your current certificate validity date:

for node in $(salt -C 'I@apache:server' test.ping --output yaml | cut -d':' -f1); do
  for name in $(salt ${node} pillar.get apache:server:site --output=json | \
  jq '.. | .host? | .name?' | grep -v null | sort | uniq); do
    salt ${node} cmd.run "openssl x509 -in /etc/ssl/certs/${name}.crt -text \
    -noout | grep -Ei 'after|before'";
  done;
done;

Example of system response:

ctl02.multinode-ha.int:
                Not Before: May 29 12:58:21 2018 GMT
                Not After : May 29 12:58:21 2019 GMT
ctl03.multinode-ha.int:
                Not Before: May 29 12:58:25 2018 GMT
                Not After : May 29 12:58:25 2019 GMT
ctl01.multinode-ha.int:
                Not Before: Apr 27 12:37:28 2018 GMT
                Not After : Apr 27 12:37:28 2019 GMT

Open your project Git repository with Reclass model on the cluster level.

For each class file with the Apache server class enabled, update the _param:apache_proxy_ssl value with the following configuration as an example:

parameters:
  _params:
    apache_proxy_ssl:
      enabled: true
      mode: secure
      key: |
        -----BEGIN RSA PRIVATE KEY-----
        MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
        ...
        0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
        -----END RSA PRIVATE KEY-----
      cert: |
        -----BEGIN CERTIFICATE-----
        MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
        ...
        /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
        -----END CERTIFICATE-----
      chain: |
        -----BEGIN RSA PRIVATE KEY-----
        MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
        ...
        0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
        -----END RSA PRIVATE KEY-----
        -----BEGIN CERTIFICATE-----
        MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
        ...
        /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
        -----END CERTIFICATE-----
        -----BEGIN CERTIFICATE-----
        MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
        ...
        M8IfJ5I=
        -----END CERTIFICATE-----

Note

Modify the example above by adding your certificates and key:

If you renew the certificates, leave your existing key and update the cert and chain sections.
If you replace the certificates, modify all three sections.

Remove your current certificates from the Apache nodes:

for node in $(salt -C 'I@apache:server' test.ping --output yaml | cut -d':' -f1); do
  for name in $(salt ${node} pillar.get apache:server:site --output=json | \
  jq '.. | .host? | .name?' | grep -v null | sort | uniq); do
    salt ${node} cmd.run "rm -f /etc/ssl/certs/${name}.crt";
  done;
done;

Apply the apache.server state on all Apache nodes one by one:
```
salt -C 'I@apache:server' state.sls apache.server
```

Verify the new certificate validity date:

for node in $(salt -C 'I@apache:server' test.ping --output yaml | cut -d':' -f1); do
  for name in $(salt ${node} pillar.get apache:server:site --output=json | \
  jq '.. | .host? | .name?' | grep -v null | sort | uniq); do
    salt ${node} cmd.run "openssl x509 -in /etc/ssl/certs/${name}.crt -text \
    -noout | grep -Ei 'after|before'";
  done;
done;

Example of system response:

ctl02.multinode-ha.int:
                Not Before: Jun  6 17:24:09 2018 GMT
                Not After : Jun  6 17:24:09 2019 GMT
ctl03.multinode-ha.int:
                Not Before: Jun  6 17:24:42 2018 GMT
                Not After : Jun  6 17:24:42 2019 GMT
ctl01.multinode-ha.int:
                Not Before: Jun  6 17:23:38 2018 GMT
                Not After : Jun  6 17:23:38 2019 GMT

Restart the Apache services one by one:

salt -C 'I@apache:server' cmd.run 'service apache2 stop; service apache2 start' -b 1

RabbitMQ certificates¶

This section describes how to renew or replace the RabbitMQ cluster certificates managed by either salt-minion or self-managed certificates using pillars.

Verify that the RabbitMQ cluster uses certificates¶

This section describes how to determine whether your RabbitMQ cluster uses certificates and identify their location on the system.

To verify that the RabbitMQ cluster uses certificates:

Run the following command:

salt -C 'I@rabbitmq:server' cmd.run "rabbitmqctl environment | \
grep -E '/ssl/|ssl_listener|protocol_version'"

Example of system response:

msg02.multinode-ha.int:
          {ssl_listeners,[{"0.0.0.0",5671}]},
              [{cacertfile,"/etc/rabbitmq/ssl/ca.pem"},
               {certfile,"/etc/rabbitmq/ssl/cert.pem"},
               {keyfile,"/etc/rabbitmq/ssl/key.pem"},
     {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},
msg01.multinode-ha.int:
          {ssl_listeners,[{"0.0.0.0",5671}]},
              [{cacertfile,"/etc/rabbitmq/ssl/ca.pem"},
               {certfile,"/etc/rabbitmq/ssl/cert.pem"},
               {keyfile,"/etc/rabbitmq/ssl/key.pem"},
     {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},
msg03.multinode-ha.int:
          {ssl_listeners,[{"0.0.0.0",5671}]},
              [{cacertfile,"/etc/rabbitmq/ssl/ca.pem"},
               {certfile,"/etc/rabbitmq/ssl/cert.pem"},
               {keyfile,"/etc/rabbitmq/ssl/key.pem"},
     {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},

Proceed to renewal or replacement of your certificates as required.

Renew or replace the RabbitMQ certificates managed by salt-minion¶

This section describes how to renew or replace the RabbitMQ certificates managed by salt-minion.

To renew or replace the RabbitMQ certificates managed by salt-minion:

Verify the certificates validity dates:

salt -C 'I@rabbitmq:server' cmd.run 'openssl x509 \
-in /etc/rabbitmq/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: Apr 27 12:37:14 2018 GMT
Not After : Apr 27 12:37:14 2019 GMT
Not Before: Apr 27 12:37:08 2018 GMT
Not After : Apr 27 12:37:08 2019 GMT
Not Before: Apr 27 12:37:13 2018 GMT
Not After : Apr 27 12:37:13 2019 GMT

Remove the certificates from the RabbitMQ nodes:

salt -C 'I@rabbitmq:server' cmd.run 'rm -f /etc/rabbitmq/ssl/cert.pem'

If you replace the certificates, remove the private key:

salt -C 'I@rabbitmq:server' cmd.run 'rm -f /etc/rabbitmq/ssl/key.pem'

Regenerate the certificates on the RabbitMQ nodes:

salt -C 'I@rabbitmq:server' state.sls salt.minion.cert

Verify that the certificates validity dates have changed:

salt -C 'I@rabbitmq:server' cmd.run 'openssl x509 \
-in /etc/rabbitmq/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: Jun  4 23:52:40 2018 GMT
Not After : Jun  4 23:52:40 2019 GMT
Not Before: Jun  4 23:52:41 2018 GMT
Not After : Jun  4 23:52:41 2019 GMT
Not Before: Jun  4 23:52:41 2018 GMT
Not After : Jun  4 23:52:41 2019 GMT

Restart the RabbitMQ services one by one:

salt -C 'I@rabbitmq:server' cmd.run 'service rabbitmq-server stop; \
service rabbitmq-server start' -b1

Verify the RabbitMQ cluster status:

salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl cluster_status'

Example of system response:

msg03.multinode-ha.int:
    Cluster status of node rabbit@msg03
    [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
     {running_nodes,[rabbit@msg01,rabbit@msg02,rabbit@msg03]},
     {cluster_name,<<"openstack">>},
     {partitions,[]},
     {alarms,[{rabbit@msg01,[]},{rabbit@msg02,[]},{rabbit@msg03,[]}]}]
msg01.multinode-ha.int:
    Cluster status of node rabbit@msg01
    [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
     {running_nodes,[rabbit@msg03,rabbit@msg02,rabbit@msg01]},
     {cluster_name,<<"openstack">>},
     {partitions,[]},
     {alarms,[{rabbit@msg03,[]},{rabbit@msg02,[]},{rabbit@msg01,[]}]}]
msg02.multinode-ha.int:
    Cluster status of node rabbit@msg02
    [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
     {running_nodes,[rabbit@msg03,rabbit@msg01,rabbit@msg02]},
     {cluster_name,<<"openstack">>},
     {partitions,[]},
     {alarms,[{rabbit@msg03,[]},{rabbit@msg01,[]},{rabbit@msg02,[]}]}]

Renew or replace the self-managed RabbitMQ certificates¶

This section describes how to renew or replace the self-managed RabbitMQ certificates.

To renew or replace the self-managed RabbitMQ certificates:

Open your project Git repository with Reclass model on the cluster level.

Create the /openstack/ssl/rabbitmq.yml file with the following configuration as an example:

classes:
- cluster.<cluster_name>.openstack.ssl
parameters:
  rabbitmq:
     server:
       enabled: true
       ...
       ssl:
         enabled: True
         key: ${_param:rabbitmq_ssl_key}
         cacert_chain: ${_param:rabbitmq_ssl_cacert_chain}
         cert: ${_param:rabbitmq_ssl_cert}

Note

Substitute <cluster_name> with the appropriate value.

Create the /openstack/ssl/init.yml file with the following configuration as an example:

parameters:
  _param:
    rabbitmq_ssl_cacert_chain: |
      -----BEGIN CERTIFICATE-----
      MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
      ...
      RHXc4FoWv9/n8ZcfsqjQCjF3vUUZBB3zdlfLCLJRruB4xxYukc3gFpFLm21+0ih+
      M8IfJ5I=
      -----END CERTIFICATE-----
    rabbitmq_ssl_key: |
      -----BEGIN RSA PRIVATE KEY-----
      MIIJKQIBAAKCAgEArVSJ16ePjCik+6bZBzhiu3enXw8R9Ms1k4x57633IX1sEZTJ
      ...
      0VgM2bDSNyUuiwCbOMK0Kyn+wGeHF/jGSbVsxYI4OeLFz8gdVUqm7olJj4j3xemY
      BlWVHRa/dEG1qfSoqFU9+IQTd+U42mtvvH3oJHEXK7WXzborIXTQ/08Ztdvy
      -----END RSA PRIVATE KEY-----
    rabbitmq_ssl_cert: |
      -----BEGIN CERTIFICATE-----
      MIIGIDCCBAigAwIBAgIJAJznLlNteaZFMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
      ...
      MfXPTUI+7+5WQLx10yavJ2gOhdyVuDVagfUM4epcriJbACuphDxHj45GINOGhaCd
      UVVCxqnB9qU16ea/kB3Yzsrus7egr9OienpDCFV2Q/kgUSc7
      -----END CERTIFICATE-----

Note

Modify the example above by adding your certificates and key:

If you renew the certificates, leave your existing key and update the cert and chain sections.
If you replace the certificates, modify all three sections.

Update the /openstack/message_queue.yml file by adding the newly created class to the RabbitMQ nodes:
```
classes:
- service.rabbitmq.server.ssl
- cluster.<cluster_name>.openstack.ssl.rabbitmq
```
Log in to the Salt Master node.

Refresh pillars:

salt -C 'I@rabbitmq:server' saltutil.refresh_pillar

Publish new certificates

salt -C 'I@rabbitmq:server' state.sls rabbitmq -l debug

Verify the new certificates validity dates:

salt -C 'I@rabbitmq:server' cmd.run 'openssl x509 \
-in /etc/rabbitmq/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: Apr 27 12:37:14 2018 GMT
Not After : Apr 27 12:37:14 2019 GMT
Not Before: Apr 27 12:37:08 2018 GMT
Not After : Apr 27 12:37:08 2019 GMT
Not Before: Apr 27 12:37:13 2018 GMT
Not After : Apr 27 12:37:13 2019 GMT

Restart the RabbitMQ services one by one:

salt -C 'I@rabbitmq:server' cmd.run 'service rabbitmq-server stop; \
service rabbitmq-server start' -b1

Verify the RabbitMQ cluster status:

salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl cluster_status'

Example of system response:

msg03.multinode-ha.int:
    Cluster status of node rabbit@msg03
    [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
     {running_nodes,[rabbit@msg01,rabbit@msg02,rabbit@msg03]},
     {cluster_name,<<"openstack">>},
     {partitions,[]},
     {alarms,[{rabbit@msg01,[]},{rabbit@msg02,[]},{rabbit@msg03,[]}]}]
msg01.multinode-ha.int:
    Cluster status of node rabbit@msg01
    [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
     {running_nodes,[rabbit@msg03,rabbit@msg02,rabbit@msg01]},
     {cluster_name,<<"openstack">>},
     {partitions,[]},
     {alarms,[{rabbit@msg03,[]},{rabbit@msg02,[]},{rabbit@msg01,[]}]}]
msg02.multinode-ha.int:
    Cluster status of node rabbit@msg02
    [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
     {running_nodes,[rabbit@msg03,rabbit@msg01,rabbit@msg02]},
     {cluster_name,<<"openstack">>},
     {partitions,[]},
     {alarms,[{rabbit@msg03,[]},{rabbit@msg01,[]},{rabbit@msg02,[]}]}]

Restart all OpenStack API services and agents.

MySQL/Galera certificates¶

This section describes how to renew or replace the MySQL/Galera certificates managed by either salt-minion or self-managed certificates using pillars.

Verify that the MySQL/Galera cluster uses certificates¶

This section describes how to determine whether your MySQL/Galera cluster uses certificates and identify their location on the system.

To verify that the MySQL/Galera cluster uses certificates:

Run the following command:

salt -C 'I@galera:master' mysql.showglobal | grep -EB3 '(have_ssl|ssl_(key|ca|cert))$'

Example of system response:

Value:
    YES
Variable_name:
    have_ssl

Value:
    /etc/mysql/ssl/ca.pem
Variable_name:
    ssl_ca

Value:
    /etc/mysql/ssl/cert.pem
Variable_name:
    ssl_cert

Value:
    /etc/mysql/ssl/key.pem
Variable_name:
    ssl_key

Proceed to renewal or replacement of your certificates as required.

Renew or replace the MySQL/Galera certificates managed by salt-minion¶

This section describes how to renew or replace the MySQL/Galera certificates managed by salt-minion.

Prerequisites:

Verify that the MySQL/Galera cluster is up and synced:

salt -C 'I@galera:master' mysql.status | grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'

Example of system response:

wsrep_cluster_size:
    3

wsrep_incoming_addresses:
    192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306

wsrep_local_state_comment:
    Synced

Verify that the log files have no errors:

salt -C 'I@galera:master or I@galera:slave' cmd.run 'cat /var/log/mysql/error.log |grep ERROR|wc -l'

Example of system response:

dbs01.multinode-ha.int
    0

dbs02.multinode-ha.int
    0

dbs03.multinode-ha.int
    0

Any value except 0 in the output indicates that the log files include errors. Review them before proceeding to operations with MySQL/Galera.

Verify that the ca-salt_master_ca certificate is available on all nodes with MySQL/Galera:

salt -C 'I@galera:master or I@galera:slave' cmd.run 'ls /usr/local/share/ca-certificates/ca-salt_master_ca.crt'

Example of system response:

dbs01.multinode-ha.int
    /usr/local/share/ca-certificates/ca-salt_master_ca.crt

dbs02.multinode-ha.int
    /usr/local/share/ca-certificates/ca-salt_master_ca.crt

dbs03.multinode-ha.int
    /usr/local/share/ca-certificates/ca-salt_master_ca.crt

To renew or replace the MySQL/Galera certificates managed by salt-minion:

Obtain the list of the Galera cluster minions:

salt -C 'I@galera:master or I@galera:slave' pillar.get _nonexistent | cut -d':' -f1

Example of system response:

dbs02.multinode-ha.int
dbs03.multinode-ha.int
dbs01.multinode-ha.int

Verify the certificates validity dates:

salt -C 'I@galera:master' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
salt -C 'I@galera:slave' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: May 30 17:21:10 2018 GMT
Not After : May 30 17:21:10 2019 GMT
Not Before: May 30 17:25:24 2018 GMT
Not After : May 30 17:25:24 2019 GMT
Not Before: May 30 17:26:52 2018 GMT
Not After : May 30 17:26:52 2019 GMT

Prepare the Galera nodes to work with old one and new Salt Master CA certificates:

salt -C 'I@galera:master or I@galera:slave' cmd.run 'cat /usr/local/share/ca-certificates/ca-salt_master_ca.crt /usr/local/share/ca-certificates/ca-salt_master_ca_old.crt > /etc/mysql/ssl/ca.pem'

Verify that the necessary files are present in the ssl directory:

salt -C 'I@galera:master or I@galera:slave' cmd.run 'ls /etc/mysql/ssl'

Example of system response:

dbs01.multinode-ha.int
    ca.pem
    cert.pem
    key.pem

dbs02.multinode-ha.int
    ca.pem
    cert.pem
    key.pem

dbs03.multinode-ha.int
    ca.pem
    cert.pem
    key.pem

Identify the Galera nodes minions IDs:

For the Galera master node:

salt -C 'I@galera:master' test.ping --output yaml | cut -d':' -f1

Example of system response:

dbs01.multinode-ha.int

For the Galera slave nodes:

salt -C 'I@galera:slave' test.ping --output yaml | cut -d':' -f1

Example of system response:

dbs02.multinode-ha.int
dbs03.multinode-ha.int

Restart the MySQL service for every Galera minion ID one by one. After each Galera minion restart, verify the Galera cluster size and status. Proceed to the next Galera minion restart only if the Galera cluster is synced.

To restart the MySQL service for a Galera minion:

salt <minion_ID> service.stop mysql
salt <minion_ID> service.start mysql

To verify the Galera cluster size and status:

salt -C 'I@galera:master' mysql.status | grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'

Example of system response:

wsrep_cluster_size:
    3

wsrep_incoming_addresses:
    192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306

wsrep_local_state_comment:
    Synced

If you replace the certificates, remove the private key:

salt -C 'I@galera:master' cmd.run 'mv /etc/mysql/ssl/key.pem /root'

Force the certificates regeneration for the Galera master node:

salt -C 'I@galera:master' cmd.run 'mv /etc/mysql/ssl/cert.pem /root; mv /etc/mysql/ssl/ca.pem /root'

salt -C 'I@galera:master' state.sls salt.minion.cert -l debug

salt -C 'I@galera:master' cmd.run 'cat /usr/local/share/ca-certificates/ca-salt_master_ca.crt /usr/local/share/ca-certificates/ca-salt_master_ca_old.crt > /etc/mysql/ssl/ca.pem'

Verify that the certificates validity dates have changed:

salt -C 'I@galera:master' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: Jun  4 16:14:24 2018 GMT
Not After : Jun  4 16:14:24 2019 GMT

Verify that the necessary files are present in the ssl directory on the Galera master node:

salt -C 'I@galera:master' cmd.run 'ls /etc/mysql/ssl'

Example of system response:

dbs01.multinode-ha.int
    ca.pem
    cert.pem
    key.pem

Restart the MySQL service on the Galera master node:

salt -C 'I@galera:master' service.stop mysql
salt -C 'I@galera:master' service.start mysql

Verify that the Galera cluster status is up. For details, see the step 7.

If you replace the certificates, remove the private key:

salt -C 'I@galera:slave' cmd.run 'mv /etc/mysql/ssl/key.pem /root'

Force the certificates regeneration for the Galera slave nodes:

salt -C 'I@galera:slave' cmd.run 'mv /etc/mysql/ssl/cert.pem /root; mv /etc/mysql/ssl/ca.pem /root'

salt -C 'I@galera:slave' state.sls salt.minion.cert -l debug

salt -C 'I@galera:slave' cmd.run 'cat /usr/local/share/ca-certificates/ca-salt_master_ca.crt /usr/local/share/ca-certificates/ca-salt_master_ca_old.crt > /etc/mysql/ssl/ca.pem'

Verify that the necessary files are present in the ssl directory on the Galera slave nodes:

salt -C 'I@galera:slave' cmd.run 'ls /etc/mysql/ssl'

Example of system response:

dbs02.multinode-ha.int
    ca.pem
    cert.pem
    key.pem

dbs03.multinode-ha.int
    ca.pem
    cert.pem
    key.pem

Verify that the certificates validity dates have changed:

salt -C 'I@galera:slave' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: Jun  4 16:14:24 2018 GMT
Not After : Jun  4 16:14:24 2019 GMT
Not Before: Jun  4 16:14:31 2018 GMT
Not After : Jun  4 16:14:31 2019 GMT

Restart the MySQL service for every Galera slave minion ID one by one. After each Galera slave minion restart, verify the Galera cluster size and status. Proceed to the next Galera slave minion restart only if the Galera cluster is synced. For details, see the step 7.

Renew or replace the self-managed MySQL/Galera certificates¶

This section describes how to renew or replace the self-managed MySQL/Galera certificates.

To renew or replace the self-managed MySQL/Galera certificates:

Create the classes/cluster/<cluster_name>/openstack/ssl/galera_master.yml file with the following configuration as an example:

classes:
- cluster.<cluster_name>.openstack.ssl
parameters:
  galera:
    master:
      ssl:
        enabled: True
        cacert_chain: ${_param:galera_ssl_cacert_chain}
        key: ${_param:galera_ssl_key}
        cert: ${_param:galera_ssl_cert}
        ca_file: ${_param:mysql_ssl_ca_file}
        key_file: ${_param:mysql_ssl_key_file}
        cert_file: ${_param:mysql_ssl_cert_file}

Note

Substitute <cluster_name> with the appropriate value.

Create the classes/cluster/<cluster_name>/openstack/ssl/galera_slave.yml file with the following configuration as an example:

classes:
- cluster.<cluster_name>.openstack.ssl
parameters:
  galera:
    slave:
      ssl:
        enabled: True
        cacert_chain: ${_param:galera_ssl_key}
        key: ${_param:galera_ssl_key}
        cert: ${_param:galera_ssl_key}
        ca_file: ${_param:mysql_ssl_ca_file}
        key_file: ${_param:mysql_ssl_key_file}
        cert_file: ${_param:mysql_ssl_cert_file}

Note

Substitute <cluster_name> with the appropriate value.

Create the classes/cluster/<cluster_name>/openstack/ssl/init.yml file with the following configuration as an example:

parameters:
  _param:
    mysql_ssl_key_file: /etc/mysql/ssl/key.pem
    mysql_ssl_cert_file: /etc/mysql/ssl/cert.pem
    mysql_ssl_ca_file: /etc/mysql/ssl/ca.pem
    galera_ssl_cacert_chain: |
      -----BEGIN CERTIFICATE-----
      MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
      ...
      RHXc4FoWv9/n8ZcfsqjQCjF3vUUZBB3zdlfLCLJRruB4xxYukc3gFpFLm21+0ih+
      M8IfJ5I=
      -----END CERTIFICATE-----
    galera_ssl_key: |
      -----BEGIN RSA PRIVATE KEY-----
      MIIJKQIBAAKCAgEArVSJ16ePjCik+6bZBzhiu3enXw8R9Ms1k4x57633IX1sEZTJ
      ...
      0VgM2bDSNyUuiwCbOMK0Kyn+wGeHF/jGSbVsxYI4OeLFz8gdVUqm7olJj4j3xemY
      BlWVHRa/dEG1qfSoqFU9+IQTd+U42mtvvH3oJHEXK7WXzborIXTQ/08Ztdvy
      -----END RSA PRIVATE KEY-----
    galera_ssl_cert: |
      -----BEGIN CERTIFICATE-----
      MIIGIDCCBAigAwIBAgIJAJznLlNteaZFMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
      ...
      MfXPTUI+7+5WQLx10yavJ2gOhdyVuDVagfUM4epcriJbACuphDxHj45GINOGhaCd
      UVVCxqnB9qU16ea/kB3Yzsrus7egr9OienpDCFV2Q/kgUSc7
      -----END CERTIFICATE-----

Note

Modify the example above by adding your certificates and key:

If you renew the certificates, leave your existing key and update the cert and chain sections.
If you replace the certificates, modify all three sections.

Update the classes/cluster/<cluster_name>/infra/config.yml file by adding the newly created classes to the database nodes:

openstack_database_node01:
  params:
    linux_system_codename: xenial
    deploy_address: ${_param:openstack_database_node01_deploy_address}
  classes:
  - cluster.${_param:cluster_name}.openstack.database_init
  - cluster.${_param:cluster_name}.openstack.ssl.galera_master
openstack_database_node02:
  params:
    linux_system_codename: xenial
    deploy_address: ${_param:openstack_database_node02_deploy_address}
  classes:
  - cluster.${_param:cluster_name}.openstack.ssl.galera_slave
openstack_database_node03:
  params:
    linux_system_codename: xenial
    deploy_address: ${_param:openstack_database_node03_deploy_address}
  classes:
  - cluster.${_param:cluster_name}.openstack.ssl.galera_slave

Regenerate the Reclass storage:

salt-call state.sls reclass.storage -l debug

Refresh pillars:

salt -C 'I@galera:master or I@galera:slave' saltutil.refresh_pillar

Verify the certificates validity dates:

salt -C 'I@galera:master' cmd.run 'openssl x509 \
-in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
salt -C 'I@galera:slave' cmd.run 'openssl x509 \
-in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: May 30 17:21:10 2018 GMT
Not After : May 30 17:21:10 2019 GMT
Not Before: May 30 17:25:24 2018 GMT
Not After : May 30 17:25:24 2019 GMT
Not Before: May 30 17:26:52 2018 GMT
Not After : May 30 17:26:52 2019 GMT

Force the certificate regeneration on the Galera master node:
```
salt -C 'I@galera:master' state.sls galera -l debug
```

Verify the new certificates validity dates on the Galera master node:

salt -C 'I@galera:master' cmd.run 'openssl x509 \
-in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Restart the MySQL service on the Galera master node:

salt -C 'I@galera:master' service.stop mysql
salt -C 'I@galera:master' service.start mysql

Verify that the Galera cluster status is up:

salt -C 'I@galera:master' mysql.status | \
grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'

Example of system response:

wsrep_cluster_size:
    3

wsrep_incoming_addresses:
    192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306

wsrep_local_state_comment:
    Synced

Force the certificate regeneration on the Galera slave nodes:
```
salt -C 'I@galera:slave' state.sls galera -l debug
```

Verify that the certificates validity dates have changed:

salt -C 'I@galera:slave' cmd.run 'openssl x509 \
-in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'

Example of system response:

Not Before: Jun  4 16:14:24 2018 GMT
Not After : Jun  4 16:14:24 2019 GMT
Not Before: Jun  4 16:14:31 2018 GMT
Not After : Jun  4 16:14:31 2019 GMT

Obtain the Galera slave nodes minions IDs:

salt -C 'I@galera:slave' test.ping --output yaml | cut -d':' -f1

Example of system response:

dbs02.multinode-ha.int
dbs03.multinode-ha.int

Restart the MySQL service for every Galera slave minion ID one by one. After each Galera slave minion restart, verify the Galera cluster size and status. Proceed to the next Galera slave minion restart only if the Galera cluster is synced.
- To restart the MySQL service for a Galera slave minion:
```
salt <minion_ID> service.stop mysql
salt <minion_ID> service.start mysql
```
- To verify the Galera cluster size and status:
```
salt -C 'I@galera:master' mysql.status | \
grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'
```
  Example of system response:
```
wsrep_cluster_size:
    3

wsrep_incoming_addresses:
    192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306

wsrep_local_state_comment:
    Synced
```

Barbican certificates¶

This section describes how to renew certificates in the Barbican service with a configured Dogtag plugin.

Renew Barbican administrator certificates¶

This section describes how to renew administrator certificates in the Barbican service with a configured Dogtag plugin.

Prerequisites:

Obtain the list of certificates:

certutil -L -d /root/.dogtag/pki-tomcat/ca/alias/

Example of system response:

Certificate Nickname   Trust Attributes
                       SSL,S/MIME,JAR/XPI
caadmin                u,u,u

Note the nickname and attributes of the administrator certificate to renew:
```
caadmin   u,u,u
```

Review the certificate validity date and note its serial number:

certutil -L -d /root/.dogtag/pki-tomcat/ca/alias/ -n "caadmin" | egrep "Serial|Before|After"

Example of system response:

Serial Number: 6 (0x6)
    Not Before: Tue Apr 26 12:42:31 2022
    Not After : Mon Apr 15 12:42:31 2024

To renew the Barbican administrator certificate:

Obtain the profile template:

pki ca-cert-request-profile-show caManualRenewal --output caManualRenewal.xml

Edit the profile template and add the serial number of the certificate to renew to the highlighted lines of the below template:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CertEnrollmentRequest>
    <Attributes/>
    <ProfileID>caManualRenewal</ProfileID>
    <Renewal>true</Renewal>
    <SerialNumber>6</SerialNumber>           <!--Insert SerialNumber here-->
    <RemoteHost></RemoteHost>
    <RemoteAddress></RemoteAddress>
    <Input id="i1">
        <ClassID>serialNumRenewInputImpl</ClassID>
        <Name>Serial Number of Certificate to Renew</Name>
        <Attribute name="serial_num">
            <Value>6</Value>           <!--Insert SerialNumber here-->
            <Descriptor>
                <Syntax>string</Syntax>
                <Description>Serial Number of Certificate to Renew</Description>
            </Descriptor>
        </Attribute>
    </Input>
</CertEnrollmentRequest>

Submit the request and note the request ID:

pki ca-cert-request-submit caManualRenewal.xml

Example of system response:

-----------------------------
Submitted certificate request
-----------------------------
  Request ID: 9
  Type: renewal
  Request Status: pending
  Operation Result: success

Using the password from /root/.dogtag/pki-tomcat/ca/password.conf, approve the request and note the ID of the new certificate:

Note

During the first run of a system with self-signed certificates you may get a warning informing of an untrusted issuer. In this case, proceed with importing the CA certificate and accept the default CA server URI.

pki -d /root/.dogtag/pki-tomcat/ca/alias/ -c rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -n caadmin ca-cert-request-review 9 --action approve

Example of system response:

-------------------------------
Approved certificate request 10
-------------------------------
  Request ID: 9
  Type: renewal
  Request Status: complete
  Operation Result: success
  Certificate ID: 0x10

Download the renewed certificate:

pki ca-cert-show 0x10 --output ca_admin_new.crt

Example of system response:

------------------
Certificate "0x10"
------------------
  Serial Number: 0x10
  Issuer: CN=CA Signing Certificate,O=EXAMPLE
  Subject: CN=PKI Administrator,E=caadmin@example.com,O=EXAMPLE
  Status: VALID
  Not Before: Tue Jun 14 12:24:14 UTC 2022
  Not After: Wed Jun 14 12:24:14 UTC 2023

Add the renewed certificate to the caadmin and kraadmin users in the LADP database:

pki -d /root/.dogtag/pki-tomcat/ca/alias/ -c rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -n caadmin ca-user-cert-add --serial 0x10 caadmin
pki -d /root/.dogtag/pki-tomcat/ca/alias/ -c rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -n caadmin kra-user-cert-add --serial 0x10 kraadmin

Example of system response:

-----------------------------------------------------------------------------------------------------------------
Added certificate "2;16;CN=CA Signing Certificate,O=EXAMPLE;CN=PKI Administrator,E=caadmin@example.com,O=EXAMPLE"
-----------------------------------------------------------------------------------------------------------------
  Cert ID: 2;16;CN=CA Signing Certificate,O=EXAMPLE;CN=PKI Administrator,E=caadmin@example.com,O=EXAMPLE
  Version: 2
  Serial Number: 0x10
  Issuer: CN=CA Signing Certificate,O=EXAMPLE
  Subject: CN=PKI Administrator,E=caadmin@example.com,O=EXAMPLE

Verify that the new certificate is present in the system:

ldapsearch -D "cn=Directory Manager" -b "dc=example,dc=com" -w rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da "uid=caadmin"
ldapsearch -D "cn=Directory Manager" -b "o=pki-tomcat-KRA" -w rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da "uid=kraadmin"

Example of system response:

# extended LDIF
#
# LDAPv3
# base <dc=example,dc=com> with scope subtree
# filter: uid=caadmin
# requesting: ALL
#

# caadmin, people, example.com
dn: uid=caadmin,ou=people,dc=example,dc=com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetOrgPerson
objectClass: cmsuser
uid: caadmin
sn: caadmin
cn: caadmin
mail: caadmin@example.com
usertype: adminType
userstate: 1
userPassword:: e1NTSEF9QWY5Mys3a2ZHRUh0cHVyMnhVbDNPcVB2TGZoZHREd2Y3ejRhYnc9PQ=
 =
description: 2;6;CN=CA Signing Certificate,O=EXAMPLE;CN=PKI Administrator,E=ca
 admin@example.com,O=EXAMPLE
description: 2;16;CN=CA Signing Certificate,O=EXAMPLE;CN=PKI Administrator,E=c
 aadmin@example.com,O=EXAMPLE
userCertificate:: MIIDnTCCAoWgAwIBAgIBBjANBgkqhkiG9w0BAQsFADAzMRAwDgYDVQQKDAdF
...
userCertificate:: MIIDnTCCAoWgAwIBAgIBEDANBgkqhkiG9w0BAQsFADAzMRAwDgYDVQQKDAdF
...

Stop the pki-tomcatd service:
```
systemctl stop pki-tomcatd
```
Delete the old certificate using the nickname noted in the prerequisite steps:
```
certutil -D -n "caadmin" -d /root/.dogtag/pki-tomcat/ca/alias/
```

Import the renewed certificate using the attributes noted in the prerequisite steps:

certutil -A -n "caadmin" -t u,u,u -d /root/.dogtag/pki-tomcat/ca/alias/ -a -i ca_admin_new.crt

Start the pki-tomcatd service:
```
systemctl start pki-tomcatd
```

Verify the new certificate:

certutil -L -d /root/.dogtag/pki-tomcat/ca/alias/ -n "caadmin" | egrep "Serial|Before|After"

Example of system response:

Serial Number: 16 (0x10)
    Not Before: Tue Jun 14 12:24:14 2022
    Not After : Wed Jun 14 12:24:14 2023

Create new ca_admin_cert.p12 and kra_admin_cert.pem files:

openssl pkcs12 -in /root/.dogtag/pki-tomcat/ca_admin_cert.p12 -passin pass:rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -passout pass:1234567 -nocerts -out passPrivateKey.pem
openssl rsa -in passPrivateKey.pem -out "privateKey.pem" -passin pass:1234567
openssl pkcs12 -export -in ca_admin_new.crt -inkey privateKey.pem -out ca_admin_new.p12 -clcerts -passout pass:rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da
openssl pkcs12 -in ca_admin_new.p12 -passin pass:rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -out kra_admin_cert_new.pem -nodes

You can change the passout and passin parameters for a stronger password pair.

Update kra_admin_cert.pem in the barbican and dogtag folders:

cp /etc/barbican/kra_admin_cert.pem ./kra_admin_cert_old.pem
cp kra_admin_cert_new.pem /etc/barbican/kra_admin_cert.pem
cp kra_admin_cert_new.pem /etc/dogtag/kra_admin_cert.pem
systemctl restart barbican-worker.service
systemctl restart apache2

Warning

Once you update the cerificate on the master node, replicate the changes to other nodes. To do so, transfer kra_admin_cert_new.pem from the master node to /etc/barbican/kra_admin_cert.pem on other nodes.

Renew Barbican system certificates¶

This section describes how to renew system certificates in the Barbican service with a configured Dogtag plugin.

Prerequisites:

Log in to the OpenStack secrets storage node (kmn).
Back up the pki-tomcat configuration files:
- /etc/pki/pki-tomcat/ca/CS.cfg
- /etc/pki/pki-tomcat/kra/CS.cfg

Obtain the list of used certificates:

certutil -L -d /etc/pki/pki-tomcat/alias

Example of system response:

Certificate Nickname                   Trust Attributes
                                       SSL,S/MIME,JAR/XPI

ocspSigningCert cert-pki-tomcat CA     u,u,u
subsystemCert cert-pki-tomcat          u,u,u
storageCert cert-pki-tomcat KRA        u,u,u
Server-Cert cert-pki-tomcat            u,u,u
caSigningCert cert-pki-tomcat CA       CTu,Cu,Cu
auditSigningCert cert-pki-tomcat CA    u,u,Pu
transportCert cert-pki-tomcat KRA      u,u,u
auditSigningCert cert-pki-tomcat KRA   u,u,Pu

Note

Server-Cert cert-pki-tomcat certificates are unique for each kmn node.

To obtain the serial numbers of these certificates, run the following command on the Salt master node:

salt 'kmn*' cmd.run "certutil -L -d /etc/pki/pki-tomcat/alias -n 'Server-Cert cert-pki-tomcat' | egrep 'Serial|Before|After'"

Example of system response:

kmn01.dogtag.local:
        Serial Number: 3 (0x3)
            Not Before: Mon Dec 19 16:57:18 2022
            Not After : Sun Dec 08 16:57:18 2024
kmn02.dogtag.local:
        Serial Number: 11 (0xb)
            Not Before: Mon Dec 19 17:02:39 2022
            Not After : Sun Dec 08 17:02:39 2024
kmn03.dogtag.local:
        Serial Number: 10 (0xa)
            Not Before: Mon Dec 19 17:00:40 2022
                      Not After : Sun Dec 08 17:00:40 2024

Other certificates are the same for all servers:

ocspSigningCert cert-pki-tomcat CA
subsystemCert cert-pki-tomcat
storageCert cert-pki-tomcat KRA
caSigningCert cert-pki-tomcat CA
auditSigningCert cert-pki-tomcat CA
transportCert cert-pki-tomcat KRA
auditSigningCert cert-pki-tomcat KRA

Note the nickname and attributes of the certificate to renew:
```
transportCert cert-pki-tomcat KRA   u,u,u
```

Review the certificate validity date and note its serial number:

certutil -L -d /etc/pki/pki-tomcat/alias -n "transportCert cert-pki-tomcat KRA" | egrep "Serial|Before|After"

Example of system response:

Serial Number: 7 (0x7)
    Not Before: Tue Apr 26 12:42:31 2022
    Not After : Mon Apr 15 12:42:31 2024

To renew the Barbican system certificate:

Obtain the profile template:

pki ca-cert-request-profile-show caManualRenewal --output caManualRenewal.xml

Edit the profile template and add the serial number of the certificate to renew to the highlighted lines of the below template:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CertEnrollmentRequest>
    <Attributes/>
    <ProfileID>caManualRenewal</ProfileID>
    <Renewal>true</Renewal>
    <SerialNumber>6</SerialNumber>           <!--Insert SerialNumber here-->
    <RemoteHost></RemoteHost>
    <RemoteAddress></RemoteAddress>
    <Input id="i1">
        <ClassID>serialNumRenewInputImpl</ClassID>
        <Name>Serial Number of Certificate to Renew</Name>
        <Attribute name="serial_num">
            <Value>6</Value>           <!--Insert SerialNumber here-->
            <Descriptor>
                <Syntax>string</Syntax>
                <Description>Serial Number of Certificate to Renew</Description>
            </Descriptor>
        </Attribute>
    </Input>
</CertEnrollmentRequest>

Submit the request and note the request ID:

pki ca-cert-request-submit caManualRenewal.xml

Example of system response:

-----------------------------
Submitted certificate request
-----------------------------
  Request ID: 16
  Type: renewal
  Request Status: pending
  Operation Result: success

Using the password from /root/.dogtag/pki-tomcat/ca/password.conf, approve the request and note the new cerificate ID:

pki -d /root/.dogtag/pki-tomcat/ca/alias/ -c rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -n caadmin ca-cert-request-review 16 --action approve

Example of system response:

-------------------------------
Approved certificate request 16
-------------------------------
  Request ID: 16
  Type: renewal
  Request Status: complete
  Operation Result: success
  Certificate ID: 0xf

Download the renewed certificate:

pki ca-cert-show 0xf --output kra_transport.crt

Example of system response:

-----------------
Certificate "0xf"
-----------------
  Serial Number: 0xf
  Issuer: CN=CA Signing Certificate,O=EXAMPLE
  Subject: CN=DRM Transport Certificate,O=EXAMPLE
  Status: VALID
  Not Before: Fri Jun 10 13:11:50 UTC 2022
  Not After: Thu May 30 13:11:50 UTC 2024

Note

You can also download an old certificate as a backup measure to revert the changes: pki ca-cert-show 0x7 --output kra_old_transport.crt

Stop the pki-tomcatd service:
```
systemctl stop pki-tomcatd
```

Delete the old certificate using nickname noted in the prerequisite steps:

certutil -D -d /etc/pki/pki-tomcat/alias -n 'transportCert cert-pki-tomcat KRA'

Import the renewed certificate using the attributes noted in the prerequisite steps:

certutil -A -d /etc/pki/pki-tomcat/alias -n 'transportCert cert-pki-tomcat KRA' -i kra_transport.crt -t "u,u,u"

Update the certificate information in /etc/pki/pki-tomcat/kra/CS.cfg with the base64-encoded data of the new certificate (without the header and footer):

kra.transport.cert=MIIDdzCCAl+gAwIBAgIBDzANBgkqhkiG9w0BAQsFADAzMRAwDgYDVQQKDAdFWEFNUExFMR8wHQYDVQQDDBZDQSBTaWduaW5nIENlcnRpZmljYXRlMB4XDTIyMDYxMDEzMTE1MFoXDTI0MDUzMDEzMTE1MFowNjEQMA4GA1UECgwHRVhBTVBMRTEiMCAGA1UEAwwZRFJNIFRyYW5zcG9ydCBDZXJ0aWZpY2F0ZTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAKkh/JkTJpr+4uaz7fNgC7gD9vJCl670nYFqxCdwQWPzmXxg5lpJcM6L15C1cOH9ad8D2h8Dv4H8YknenK0GXrEoRrqAnG9nMHPEV1HTap9exOgZ4gk6aIanzEqqbun54mkAKF0LLmytcY5oyJLVoDpPVnkawLXdWyi7lUFEMnsILGH0kS0o8/9TPlP8HPXegHTAPMZWRHAVyDDJVmw/Qv8NB/oOFuWCZSDOItZQB44i7WgRIj6lPwCQoCZqdjRtq+bW9ITqZnXrFZmfFPfeh1NGiTnoE3aebm2FTCA5R9+9nsoQtAE6cdsmalua8+P0le2oGXbWED8meViYkAwLxYcCAwEAAaOBkjCBjzAfBgNVHSMEGDAWgBRBpHkrpV9RctaTblmz/H96WRecxDBHBggrBgEFBQcBAQQ7MDkwNwYIKwYBBQUHMAGGK2h0dHA6Ly9rbW4wMS5ybC1kb2d0YWctMi5sb2NhbDo4MDgwL2NhL29jc3AwDgYDVR0PAQH/BAQDAgTwMBMGA1UdJQQMMAoGCCsGAQUFBwMCMA0GCSqGSIb3DQEBCwUAA4IBAQCbzfQob6XsJYv3fDEytLH6mn7xBe1Z+U9qUE9V1GWNAeR8Q8cTSdroNS1VARiZwdOfJMnmryZ/WQJuafJTzwtlw1Ge7DaLa2xULGeHmUCfbV+GFSyiF8d90yBHYN6vGWqRVyEDQKegit0OG092bsxIHawPBGh2LabUPuoKmKuz2EzWRLmtOWgE8irHdG8kqsjXHlFpJWq/NNeqK5mohMJeDVzFevkJnj7lizY2F+UCfOs/JG1PZZEdgFoE4Un9VsOoUdWFsNwfRLUiAWzSQrHEm2FzqltTlEutI66PovwZMnqh9AGw67m8YLXDrIQqiAzvzF3YGsxeJvXi0VL8j64B

Note

When updating all certificates, update both the /etc/pki/pki-tomcat/kra/CS.cfg and /etc/pki/pki-tomcat/ca/CS.cfg files.

Examples of files containing data about the certificates:

/etc/pki/pki-tomcat/ca/CS.cfg

ca.subsystem.cert <- subsystemCert cert-pki-tomcat
ca.ocsp_signing.cert <- ocspSigningCert cert-pki-tomcat CA
ca.sslserver.cert <- Server-Cert cert-pki-tomcat (unique for each kmn server)
ca.signing.cert <- caSigningCert cert-pki-tomcat CA
ca.audit_signing.cert <- auditSigningCert cert-pki-tomcat CA


/etc/pki/pki-tomcat/kra/CS.cfg

kra.subsystem.cert <- subsystemCert cert-pki-tomcat
kra.storage.cert <- storageCert cert-pki-tomcat KRA
kra.sslserver.cert <- Server-Cert cert-pki-tomcat (unique for each kmn server)
kra.transport.cert <- transportCert cert-pki-tomcat KRA
kra.audit_signing.cert <- auditSigningCert cert-pki-tomcat KRA

If you are updating a transportCert cert-pki-tomcat KRA certificate, also update Barbican Network Security Services database (NSSDB):

certutil -L -d /etc/barbican/alias
certutil -L -d /etc/barbican/alias -n "KRA transport cert" | egrep "Serial|Before|After"
certutil -D -d /etc/barbican/alias -n 'KRA transport cert'
certutil -A -d /etc/barbican/alias -n 'KRA transport cert' -i kra_transport.crt -t ,,

Start the pki-tomcatd service:
```
systemctl start pki-tomcatd
```

Verify that the new certificate is used:

certutil -L -d /etc/pki/pki-tomcat/alias -n "transportCert cert-pki-tomcat KRA" | egrep "Serial|Before|After"
certutil -L -d /etc/barbican/alias -n "KRA transport cert" | egrep "Serial|Before|After"

Replicate newly generated certificates to other nodes. If you have updated the Server-Cert cert-pki-tomcat certificates, verify that each kmn node has a unique updated certificate.
Upload the renewed certificates to the remaining kmn nodes by repeating steps 7-13.

Add a new LDAP user certificate¶

You may need to add a new certificate for an LDAP user. For example, if the certificate is outdated and you cannot access the subsystem. In this case, you may see the following error message when trying to reach the Key Recovery Authority (KRA) subsystem:

pki -d /root/.dogtag/pki-tomcat/ca/alias/ -c rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -n caadmin kra
PKIException: Unauthorized

To add a new LDAP user certificate:

Obtain information about the new certificate:

pki cert-show 16 --encode

Note the certificate data and serial number in decimal (serial number: 0x10 = 16).

Verify that the LDAP user does not have a new certificate:

ldapsearch -D "cn=Directory Manager" -b "o=pki-tomcat-KRA" -w rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -h kmn01.rl-dogtag-2.local "uid=kraadmin"

Create a modify_kra_admin.ldif file with information about the new certificate. In userCertificate, insert the certificate data and in description insert the certificate ID with a correct serial number in decimal (serial number: 0x10 = 16).

Apply the changes:

ldapmodify -D "cn=Directory Manager" -w rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -h kmn01.rl-dogtag-2.local -f ./modify_kra_admin.ldif

Verify that the LDAP user has a new certificate:

ldapsearch -D "cn=Directory Manager" -b "o=pki-tomcat-KRA" -w rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -h kmn01.rl-dogtag-2.local "uid=kraadmin"

Example of system response:

# extended LDIF
#
# LDAPv3
# base <o=pki-tomcat-KRA> with scope subtree
# filter: uid=kraadmin
# requesting: ALL
#

# kraadmin, people, pki-tomcat-KRA
dn: uid=kraadmin,ou=people,o=pki-tomcat-KRA
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetOrgPerson
objectClass: cmsuser
uid: kraadmin
sn: kraadmin
cn: kraadmin
mail: kraadmin@example.com
usertype: adminType
userstate: 1
userPassword:: e1NTSEF9a2N4aUEvS1BzMWtDZ3VYK1hnaGxNa1QwdDk1emhoZk4yL2xvR2c9PQ=
 =
description: 2;6;CN=CA Signing Certificate,O=EXAMPLE;CN=PKI Administrator,E=ca
 admin@example.com,O=EXAMPLE
description: 2;16;CN=CA Signing Certificate,O=EXAMPLE;CN=PKI Administrator,E=c
 aadmin@example.com,O=EXAMPLE
userCertificate:: MIIDnTCCAoWgAwIBAgIBBjANBgkqhkiG9w0BAQsFADAzMRAwDgYDVQQKDAdF
 WEFNUExFMR8wHQYDVQQDDBZDQSBTaWduaW5nIENlcnRpZmljYXRlMB4XDTIyMDQyNjEyNDEzN1oXD
 TI0MDQxNTEyNDEzN1owUjEQMA4GA1UECgwHRVhBTVBMRTEiMCAGCSqGSIb3DQEJARYTY2FhZG1pbk
 BleGFtcGxlLmNvbTEaMBgGA1UEAwwRUEtJIEFkbWluaXN0cmF0b3IwggEiMA0GCSqGSIb3DQEBAQU
 AA4IBDwAwggEKAoIBAQCv28DjVwwQLIGkmHgL+ySLY/ja8rKAmL+e7wE1sub6fMFBnSNIi3FbX685
 0/Nx3GgU+IrwS9lwvVXArs7Z7Kw/rm29CDrWlC8fWNYzTmQwhgIlccOiOuaa0QktWUuCUyjhDLyU6
 VGRUIUMz4EG7TU7zg71nYrVjR8elKBDS/ol1jq5qymG0IbKCfL6mNhjTVOy5awbW3jabRp6QgAeRv
 ABzF2R9xVee25/E42351lX76fhnoMvyaMeRfu+l3KVaSHNzupljr0GNo+l4Wfi2LkxxdX435uv8id
 0o52KzbofjJMaWdoL70rkL/xng/gaWQ4mW0u0cJyo+vVdgIWxUcDBAgMBAAGjgZwwgZkwHwYDVR0j
 BBgwFoAUQaR5K6VfUXLWk25Zs/x/elkXnMQwRwYIKwYBBQUHAQEEOzA5MDcGCCsGAQUFBzABhitod
 HRwOi8va21uMDEucmwtZG9ndGFnLTIubG9jYWw6ODA4MC9jYS9vY3NwMA4GA1UdDwEB/wQEAwIE8D
 AdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwDQYJKoZIhvcNAQELBQADggEBAAvXysrUFQT
 gQqQudT7jzxj/X++gNytno0kWOQeIoJSgp0qiz4RFVF/RIF7zn0jMl6a3hipRBU2nU1Fr4De/xcx4
 gPD/MWJquD6bSNywlYCkhxCwf3Z8xwLlyV1pYQ8YQAkVK0S9qLHLgjZdPRuzW3SGpyOevcY9JaLpX
 qaYJ5Tr9fiAcoD8jvf2w0cRmYVw2RELP3ATTrF1V00WnyVwDyda8eNacBxOd831mQOrA9JJm5c/fQ
 cZr0MovXjyU3ddp3MXS4zmTz4skR3qjvHBSRuUuOAvXhnXtP1OzPeLNSGsXozcL/0mqSEQFrV+TiF
 7hVeYF0IGhvkWQOvKdDgZMF8=
userCertificate:: MIIDnTCCAoWgAwIBAgIBEDANBgkqhkiG9w0BAQsFADAzMRAwDgYDVQQKDAdF
 WEFNUExFMR8wHQYDVQQDDBZDQSBTaWduaW5nIENlcnRpZmljYXRlMB4XDTIyMDYxNDEyMjQxNFoXD
 TIzMDYxNDEyMjQxNFowUjEQMA4GA1UECgwHRVhBTVBMRTEiMCAGCSqGSIb3DQEJARYTY2FhZG1pbk
 BleGFtcGxlLmNvbTEaMBgGA1UEAwwRUEtJIEFkbWluaXN0cmF0b3IwggEiMA0GCSqGSIb3DQEBAQU
 AA4IBDwAwggEKAoIBAQCv28DjVwwQLIGkmHgL+ySLY/ja8rKAmL+e7wE1sub6fMFBnSNIi3FbX685
 0/Nx3GgU+IrwS9lwvVXArs7Z7Kw/rm29CDrWlC8fWNYzTmQwhgIlccOiOuaa0QktWUuCUyjhDLyU6
 VGRUIUMz4EG7TU7zg71nYrVjR8elKBDS/ol1jq5qymG0IbKCfL6mNhjTVOy5awbW3jabRp6QgAeRv
 ABzF2R9xVee25/E42351lX76fhnoMvyaMeRfu+l3KVaSHNzupljr0GNo+l4Wfi2LkxxdX435uv8id
 0o52KzbofjJMaWdoL70rkL/xng/gaWQ4mW0u0cJyo+vVdgIWxUcDBAgMBAAGjgZwwgZkwHwYDVR0j
 BBgwFoAUQaR5K6VfUXLWk25Zs/x/elkXnMQwRwYIKwYBBQUHAQEEOzA5MDcGCCsGAQUFBzABhitod
 HRwOi8va21uMDEucmwtZG9ndGFnLTIubG9jYWw6ODA4MC9jYS9vY3NwMA4GA1UdDwEB/wQEAwIE8D
 AdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwDQYJKoZIhvcNAQELBQADggEBALmwU2uL1tB
 l2n2kEUaxyrA+GMmFIZg58hS0Wo2c92lhF1pYypRVy44Bf+iOcdixCCy1rV0tpf7qng5VjnFq9aEk
 bQ14Zg+u6oNopZCKBKFD5lLeEu5wlvuQEsTiTay5dzaqdZ1nQ5yobyuTuOOepKTbGzVKh1qPCYLGG
 X6TUzZB8y8ORqgrm9yo1i9BStUSzDhisATkGBoltK8zFeNdXfjd91VsaeiLQz4p38kqv05tCHshJN
 E7SLwkcGOC3bOQO2EEQJ0U+2QTMX2bg+u41TiPYkFeXvyqXHcmnyGnxhGT18TWH48rxGNh53x5qVF
 rT8AoLwQvSnmT7CpSeF9ebWw=

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

Verify that the subsystem is accessible:

pki -d /root/.dogtag/pki-tomcat/ca/alias/ -c rCWuvkszR4tbiDmMHfpLqJDtVQbHP1da -n caadmin kra

Example of system response:

Commands:
 kra-group               Group management commands
 kra-key                 Key management commands
 kra-selftest            Selftest management commands
 kra-user                User management commands

Certificates for Nova compute services¶

This section describes how to update certificates required for Nova compute services managed by salt-minion.

Update libvirt certificates¶

This section describes how to update the libvirt certificates managed by salt-minion.

To update the libvirt certificates managed by salt-minion:

Create certificate backups for all compute nodes:

salt -C 'I@nova:compute' cmd.run 'cp -pr /etc/pki/libvirt-vnc/server-cert.pem /etc/pki/libvirt-vnc/server-cert.pem_$(date +"%Y_%m_%d").bak'

Remove your current certificates from each compute node:

salt -C 'I@nova:compute' cmd.run 'rm -rf /etc/pki/libvirt-vnc/server-cert.pem'

Apply the salt.minion.grains state for all compute nodes to retrieve the CA certificate from Salt Master:

salt -C 'I@nova:compute' state.sls salt.minion.grains test=true -b 1
salt -C 'I@nova:compute' state.sls salt.minion.grains -b 1

Apply the salt.minion.cert state for all compute nodes:

salt -C 'I@nova:compute' state.sls salt.minion.cert test=true -b 2
salt -C 'I@nova:compute' state.sls salt.minion.cert -b 2

Restart the libvrit service on one of the compute nodes:
```
salt '*cmp*' cmd.run 'service libvirtd restart'
```

Verify that the service has restarted successfully:

salt '*cmp*' cmd.run 'service libvirtd status'

Restart the libvirt service and apply the nova state on the remaining nova compute nodes:

salt -C 'I@nova:compute' cmd.run 'service libvirtd restart' -b 1
salt -C 'I@nova:compute'  state.sls nova test=true -b 2
salt -C 'I@nova:compute'  state.sls nova -b 2

Update nova certificates¶

This section describes how to update the Nova NoVNCProxy certificates managed by salt-minion.

To update the Nova NoVNCProxy certificates managed by salt-minion:

Create certificate backups for all compute nodes:

salt '*ctl*' cmd.run 'cp -pr /etc/pki/nova-novncproxy/client-cert.pem /etc/pki/nova-novncproxy/client-cert.pem_$(date +"%Y_%m_%d").bak'
salt '*ctl*' cmd.run 'cp -pr /etc/pki/nova-novncproxy/server-cert.pem /etc/pki/nova-novncproxy/server-cert.pem_$(date +"%Y_%m_%d").bak'

Remove your current certificates from each compute node:

salt '*ctl*' cmd.run 'rm -rf /etc/pki/nova-novncproxy/client-cert.pem /etc/pki/nova-novncproxy/server-cert.pem'

Apply the salt.minion.grains state for all compute nodes to retrieve the CA certificate from Salt Master:

salt '*ctl*' state.sls salt.minion.grains test=true -b 1
salt '*ctl*' state.sls salt.minion.grains -b 1

Apply the salt.minion.cert state on all compute nodes:

salt '*ctl*' state.sls salt.minion.cert test=true -b 2
salt '*ctl*' state.sls salt.minion.cert -b 2

Restart the nova-novncproxy service:

salt '*ctl*' cmd.run 'service nova-novncproxy restart' -b 1

Change the certificate validity period¶

You can change a certificate validity period by managing the validity period of the signing policy, which is used for certificates generation and is set to 365 days by default.

Note

The procedure does not update the CA certificates and does not change the signing policy itself.

To change the certificate validity period:

In classes/cluster/<cluster_name>/infra/config/init.yml, specify the following pillar:

parameters:
  _param:
    salt_minion_ca_days_valid_certificate: <required_value>
    qemu_vnc_ca_days_valid_certificate: <required_value>

Apply the changes:

salt '*' saltutil.sync_all
salt -C 'I@salt:master' state.sls salt.minion.ca
salt -C 'I@salt:master' state.sls salt.minion

Remove the certificate you need to update.

Apply the following state:

salt -C '<target_node>' state.sls salt.minion.cert

Verify the end date of the updated certificate:

salt -C <taget_node> cmd.run 'openssl x509 -enddate -noout -in <path_to_cert>'

Enable FQDN on internal endpoints in the Keystone catalog¶

In the new MCP 2019.2.3 deployments, the OpenStack environments use FQDN on the internal endpoints in the Keystone catalog by default.

In the existing MCP deployments, the IP addresses are used on the internal Keystone endpoints. This section instructs you on how to enable FQDN on the internal endpoints for the existing MCP deployments updated to the MCP 2019.2.3 or newer version.

To enable FQDN on the Keystone internal endpoints:

Verify that you have updated MCP DriveTrain to the 2019.2.3 or newer version as described in Update DriveTrain.
Log in to the Salt Master node.

On the system Reclass level:

Verify that there are classes present under the /srv/salt/reclass/classes/system/linux/network/hosts/openstack directory.

Verify that the following parameters are set in defaults/openstack/init.yml as follows:

parameters:
  _param:
    openstack_service_hostname: os-ctl-vip
    openstack_service_host: ${_param:openstack_service_hostname}.${linux:system:domain}

If you have the extra OpenStack services installed, define the additional parameters in defaults/openstack/init.yml as required:

For Manila:

parameters:
  _param:
    openstack_share_service_hostname: os-share-vip
    openstack_share_service_host: ${_param:openstack_share_service_hostname}.${linux:system:domain}

For Barbican:

parameters:
  _param:
    openstack_kmn_service_hostname: os-kmn-vip
    openstack_kmn_service_host: ${_param:openstack_kmn_service_hostname}.${linux:system:domain}

For Tenant Telemetry:

parameters:
  _param:
    openstack_telemetry_service_hostname: os-telemetry-vip
    openstack_telemetry_service_host: ${_param:openstack_telemetry_service_hostname}.${linux:system:domain}

On the cluster Reclass level, configure the FQDN on internal endpoints by editing infra/init.yml:
1. Add the following class for the core OpenStack services:
```
classes:
  - system.linux.network.hosts.openstack
```
2. If you have the extra OpenStack services installed, define the additional classes as required:
  - For Manila:
    classes: - system.linux.network.hosts.openstack.share
  - For Barbican:
    classes: - system.linux.network.hosts.openstack.kmn
  - For Tenant Telemetry:
    classes: - system.linux.network.hosts.openstack.telemetry

On the cluster Reclass level, define the following parameters in the openstack/init.yml file:

Define the following parameters for the core OpenStack services:

parameters:
  _param:
    glance_service_host: ${_param:openstack_service_host}
    keystone_service_host: ${_param:openstack_service_host}
    heat_service_host: ${_param:openstack_service_host}
    cinder_service_host: ${_param:openstack_service_host}
    nova_service_host: ${_param:openstack_service_host}
    placement_service_host: ${_param:openstack_service_host}
    neutron_service_host: ${_param:openstack_service_host}

If you have the extra services installed, define the following parameters as required:

For Tenant Telemetry:

parameters:
  _param:
    aodh_service_host: ${_param:openstack_telemetry_service_host}
    ceilometer_service_host: ${_param:openstack_telemetry_service_host}
    panko_service_host: ${_param:openstack_telemetry_service_host}
    gnocchi_service_host: ${_param:openstack_telemetry_service_host}

For Manila:

parameters:
  _param:
    manila_service_host: ${_param:openstack_share_service_host}

For Designate:

parameters:
  _param:
    designate_service_host: ${_param:openstack_service_host}

For Barbican:

parameters:
  _param:
    barbican_service_host: ${_param:openstack_kmn_service_host}

Apply the keystone state:

salt -C 'I@keystone:server' state.apply keystone

Log in to one of the OpenStack controller nodes.
Verify that the changes have been applied successfully:
```
openstack endpoint list
```

If SSL is used on the Keystone internal endpoints:

If Manila or Telemetry is installed:

Log in to the Salt Master node.
Open the Reclass cluster level of your deployment.

For Manila, edit /openstack/share.yml. For example:

parameters:
  _param:
    openstack_api_cert_alternative_names: IP:127.0.0.1,IP:${_param:cluster_local_address},IP:${_param:cluster_vip_address},DNS:${linux:system:name},DNS:${linux:network:fqdn},DNS:${_param:cluster_vip_address},DNS:${_param:openstack_share_service_host}

For Tenant Telemetry, edit /openstack/telemetry.yml. For example:

parameters:
  _param:
    openstack_api_cert_alternative_names: IP:128.0.0.1,IP:${_param:cluster_local_address},IP:${_param:cluster_vip_address},DNS:${linux:system:name},DNS:${linux:network:fqdn},DNS:${_param:cluster_vip_address},DNS:${_param:openstack_telemetry_service_host}

Renew the OpenStack API certificates to include FQDN in CommonName (CN) as described in Manage certificates.

Enable Keystone security compliance policies¶

In the MCP OpenStack deployments, you can enable additional Keystone security compliance features independently of each other based on your corporate security policy. All available features apply only to the SQL back end for the Identity driver. By default, all security compliance features are disabled.

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to enable the Keystone security compliance features on an existing MCP OpenStack deployment. For the new deployments, you can configure the compliance features during the Reclass deployment model creation through Model Designer.

**Keystone security compliance parameters**¶
Operation	Enable in Keystone for all SQL back-end users	Override settings for specific users
Force the user to change the password upon the first use	`change_password_upon_first_use: True` Forces the user to change their password upon the first use	`ignore_change_password_upon_first_use: True`
Configure password expiration	`password_expires_days: <NUM>` Sets the number of days after which the password would expire	`ignore_password_expiry: True`
Set an account lockout threshold	`lockout_failure_attempts: <NUM>` Sets the maximum number of failed authentication attempts `lockout_duration: <NUM>` Sets the number of minutes (in seconds) after which a user would be locked out	`ignore_lockout_failure_attempts: True`
Restrict the user from changing their password	N/A	`lock_password: True`
Configure password strength requirements	`password_regex: <STRING>` [1] Sets the strength requirements for the passwords `password_regex_description: <STRING>` Provides the text that describes the password strength requirements. Required if the `password_regex` is set.	N/A
Disable inactive users	`disable_user_account_days_inactive: <NUM>` [2] Sets the number of days after which the user would be disabled	N/A
Configure a unique password history	`unique_last_password_count: <NUM>` Sets the number of passwords for a user that must be unique before an old password can be reused `minimum_password_age: <NUM>` Sets the number of days for the password to be used before the user can change it	N/A

Warning

[1] When enabled, it may affect all operations with Heat. Heat creates its service users with its own regex, which is 32 characters long and contains uppercase and lowercase letters, digits, and special characters such as !, @, #, %, ^, & and *. Therefore, not to affect the Heat operations, verify that your custom value for this option allows such generated passwords. Currently, you cannot override the password regex enforcement in Keystone for a specific user.

[2]

When enabled, it may affect autoscaling and other operations with Heat that require the deferred authentication. If you need to perform such operations in the Heat stack for the first time after deployment upon the defined termination period and the Heat service user created during the deployment has been inactive during this termination period, the Heat service user will be disabled and not able to authenticate. Currently, you cannot override this parameter in Keystone for a specific user.

To enable the security compliance policies:

Log in to the Salt Master node.
Open your Git project repository with the Reclass model on the cluster level.
Open the openstack/control/init.yml file for editing.

Configure the security compliance policies for the OpenStack service users as required.

For all OpenStack service users. For example:

parameters:
  _param:
    openstack_service_user_options:
      ignore_change_password_upon_first_use: True
      ignore_password_expiry: True
      ignore_lockout_failure_attempts: False
      lock_password: False

For specific OpenStack Queens and newer OpenStack releases service users. For example:

keystone:
    client:
      resources:
        v3:
          users:
            cinder:
              options:
                ignore_change_password_upon_first_use: True
                ignore_password_expiry: False
                ignore_lockout_failure_attempts: False
                lock_password: True

For specific OpenStack Pike and older OpenStack releases service users. For example:

keystone:
   client:
     server:
       identity:
         project:
           service:
             user:
               cinder:
                 options:
                   ignore_change_password_upon_first_use: True
                   ignore_password_expiry: False
                   ignore_lockout_failure_attempts: False
                   lock_password: True

Enable the security compliance features on the Keystone server side by defining the related Keystone sever parameters as required.

Example configuration:

keystone:
  server:
    security_compliance:
      disable_user_account_days_inactive: 90
      lockout_failure_attempts: 5
      lockout_duration: 600
      password_expires_days: 90
      unique_last_password_count: 10
      minimum_password_age: 0
      password_regex: '^(?=.*\d)(?=.*[a-zA-Z]).{7,}$$'
      password_regex_description: 'Your password must contains at least 1 letter, 1 digit, and have a minimum length of 7 characters'
      change_password_upon_first_use: true

Apply the changes:

salt -C 'I@keystone:client' state.sls keystone.client
salt -C 'I@keystone:server' state.sls keystone.server

Restrict the VM image policy¶

This section instructs you on how to restrict Glance, Nova, and Cinder snapshot policy to only allow Administrators to manage images and snapshots in your OpenStack environment.

To configure Administrator only policy:

In the /etc/nova directory, create and edit the policy.json for Nova as follows:

{
    "os_compute_api:servers:create_image": "rule:admin_api",
    "os_compute_api:servers:create_image:allow_volume_backed": "rule:admin_api",
}

In the openstack/control.yml file, restrict managing operations by setting the role:admin value for the following parameters for Glance and Cinder:

parameters:
  glance:
    server:
      policy:
        add_image: "role:admin"
        delete_image: "role:admin"
        modify_image: "role:admin"
        publicize_image: "role:admin"
        copy_from: "role:admin"
        upload_image: "role:admin"
        delete_image_location: "role:admin"
        set_image_location: "role:admin"
        deactivate: "role:admin"
        reactivate: "role:admin"
  cinder:
    server:
      policy:
        'volume_extension:volume_actions:upload_image': "role:admin"

Apply the following states:

salt 'ctl*' state.sls glance.server,cinder.controller

Verify that the rules have changed in the states output.
If the Comment: State 'keystone_policy.rule_present' was not found in SLS 'glance.server' error occurs, synchronize Salt modules and re-apply the glance.server state:
```
salt 'ctl*' saltutil.sync_all
salt 'ctl*' state.sls glance.server
```
To apply the changes, restart the glance-api service:
```
salt 'ctl*' service.restart glance-api
```

Configure Neutron OVS¶

After deploying an OpenStack environment with Neutron Open vSwitch, you may want to enable some of the additional features and configurations that Neutron provides.

This section describes how to enable and operate supported Neutron features.

Configure Neutron Quality of Service (QoS)¶

Neutron Quality of Service, or QoS, is a Neutron feature that enables OpenStack administrators to limit and prioritize network traffic through a set of policies for better network bandwidth.

MCP supports QoS policies with the following limitations:

Bandwidth limit for SR-IOV must be specified in Megabits per second (Mbps) and be dividable by 1000 Kilobits per second (Kbps).

All values lower than 1000 KB per second are rounded up to 1 MB per second. Since float numbers are not supported, all values that cannot be divided by 1000 Kbps chunks are rounded up to the nearest integer Mbps value.
QoS rules are supported for the egress traffic only.
The network interface driver must support minimum transmit bandwidth (min_tx_rate).

Minimum transmit bandwidth is supported by such drivers as QLogic 10 Gigabit Ethernet Driver (qlcnic), BNXT Poll Mode driver (bnxt), and so on. The Intel Linux ixgbe and i40e drivers do not support setting minimum transmit bandwidth.
No automatic oversubscription protection.

Since the minimum transmit bandwidth is supported on the hypervisor level, your network is not protected from oversubscription. Total bandwidth on all ports may exceed maximum available bandwidth in the provider’s network.

This section describes how to configure Neutron Quality of Service.

Enable Neutron Quality of Service¶

By default, Neutron QoS is disabled. You can enable Neutron QoS before or after deploying an OpenStack environment.

To enable Neutron Quality of Service

Log in to the Salt Master node.
Open the cluster.<cluster-name>.openstack.init.yml file for editing.
Set the neutron_qos_enabled parameter to True.
```
parameters:
  _param:
    neutron_enable_qos: True
    ...
```
This turns on the QoS functionality. Depending on the deployment, the command uploads extensions for the openvswitch or/and sriovnicswitch agents.

Re-run Salt configuration on the Salt Master node:

salt -C 'I@neutron:server' state.sls neutron
salt -C 'I@neutron:gateway' state.sls neutron
salt -C 'I@neutron:compute' state.sls neutron

Proceed to Create a QoS policy.

Create a QoS policy¶

After you enable the Neutron Quality of Service feature, configure a QoS policy to prioritize one type of traffic over the other. This section describes basic operations. For more information, see: OpenStack documentation.

To create a QoS policy:

Log in to an OpenStack controller node.
Create a QoS policy:
```
neutron qos-policy-create bw-limiter
```

Add a rule to the QoSpolicy:

neutron qos-bandwidth-limit-rule-create bw-limiter --max-kbps 3000000

Apply the QoS policy:

To a new network:

neutron net-create <network-name> --qos-policy bw-limiter

To an existing network:

neutron net-update test --qos-policy bw-limiter

To a new port:

neutron port-create test --name sriov_port --binding:vnic_type direct
--qos-policy bw-limiter

To an existing port:

neutron port-update sriov_port --qos-policy bw-limiter

See also

Applying changes to a QoS policy

Applying changes to a QoS policy¶

You can update or remove an existing QoS policy.

To update a QoS policy:

Update the QoS policy using the neutron qos-bandwidth-limit-rule-update <qos-policy-name> command.

Example:

rule_id=`neutron qos-bandwidth-limit-rule-list bw-limiter -f value -c id`
neutron qos-bandwidth-limit-rule-update bw-limiter $rule_id --max-kbps 200000

To remove a QoS policy:

Remove from:

A network:

neutron net-update <network-name> --no-qos-policy

A port:

neutron port-update <sriov_port> --no-qos-policy

Enable network trunking¶

The Mirantis Cloud Platform supports port trunking which enables you to attach a virtual machine to multiple Neutron networks using VLANs as a local encapsulation to differentiate traffic for each network as it goes in and out of a single virtual machine network interface (VIF).

Using network trunking is particularly beneficial in the following use cases:

Some applications require connection to hundreds of Neutron networks. To achieve this, you may want to use a single or a few VIFs and VLANs to differentiate traffic for each network rather than having hundreds of VIFs per VM.
Cloud workloads are often very dynamic. You may prefer to add or remove VLANs rather than to hotplug interfaces in a virtual machine.
Moving a virtual machine from one network to another without detaching the VIF from the virtual machine.
A virtual machine may run many containers. Each container may have requirements to be connected to different Neutron networks. Assigning a VLAN or other encapsulation ID for each container is more efficient and scalable than requiring a vNIC per container.
Some legacy applications that require VLANs to connect to multiple networks.

The current limitation of network trunking support is that MCP supports only Neutron OVS with DPDK and the Open vSwitch firewall driver enabled. Other Neutron ML2 plugins, such as Linux Bridge and OVN, are not supported. If you use security groups and network trunking, MCP automatically enables the native Open vSwitch firewall driver.

To enable network trunking:

Log in to the Salt Master node.
Open the cluster.<NAME>.openstack.init.yml file for editing.

Set the neutron_enable_vlan_aware_vms parameter to True:

parameters:
  _param:
    neutron_enable_vlan_aware_vms: True
    ...

Re-run Salt configuration:

salt -C 'I@neutron:server' state.sls neutron
salt -C 'I@neutron:gateway' state.sls neutron
salt -C 'I@neutron:compute' state.sls neutron

See also

OpenStack documentation

Enable L2 Gateway support¶

The L2 Gateway (L2GW) plugin for the Neutron service provides the ability to interconnect a given tenant network with a VLAN on a physical switch. The basic components of L2GW include:

L2GW Service plugin
Residing on a controller node, the L2GW Service plugin notifies the L2GW agent and normal L2 OVS agents running on compute hosts about network events and distributes the VTEP IP address information between them.
L2GW agent
Running on a network node, the L2GW agent is responsible for connecting to OVSDB server running on a hardware switch and updating the database based on instructions received from the L2GW service plugin.

Before you proceed with the L2GW enablement, verify that the following requirements are met:

OVSDB Hardware VTEP physical switch enabled
L2 population mechanism driver enabled

To enable L2GW support:

Log in to the Salt Master node.
In the classes/cluster/<cluster_name>/openstack/control.yml file of your Reclass model, configure the OpenStack controller nodes by including the service.neutron.control.services.l2gw class.
In the classes/cluster/<cluster_name>/openstack/gateway.yml file of your Reclass model, add the Neutron L2GW agent configuration. For example:
```
neutron:
  gateway:
    l2gw:
      enabled: true
      debug: true
      ovsdb_hosts:
        ovsdb1: 10.164.5.253:6622
        ovsdb2: 10.164.5.254:6622
```
Note

ovsdb{1,2}
User-defined identifier of a physical switch, which is a name that will be used in the OpenStack database to identify this switch.
Apply the neutron state to the server nodes to install the service plugin packages, enable the L2GW service plugin, and update the Neutron database with the new schema:
```
salt -I 'neutron:server' state.sls neutron -b 1
```
Apply the neutron state to the gateway nodes to install the L2GW agent packages and configure the OVSDB parameters that include a switch pointer with the IP address and port:
```
salt -I 'neutron:gateway' state.sls neutron
```
Verify that the L2GW Neutron service plugin is enabled in your deployment:
1. Log in to one of the OpenStack controller nodes.
2. Verify that the following command is executed without errors:
```
neutron l2-gateway-list
```

See also

L2 Gateway official documentation

Configure BGP VPN¶

The Mirantis Cloud Platform (MCP) supports the Neutron Border Gateway Protocol (BGP) VPN Interconnection service. The BGP-based IP VPNs are commonly used in the industry, mainly for enterprises.

You can use the BGP VPN Interconnection service in the following typical use case: a tenant has a BGP IP VPN (a set of external sites) already set up outside the data center and wants to be able to trigger the establishment of a connection between VMs and these VPN external sites.

Enable the BGP VPN Interconnection service¶

If you have an existing BGP IP VPN (a set of external sites) set up outside the data center, you can enable the BGP VPN Interconnection service in MCP to be able to trigger the establishment of connectivity between VMs and these VPN external sites.

The drivers for the BGP VPN Interconnection service include:

OVS/BaGPipe driver
OpenContrail driver
OpenDaylight driver

To enable the BGP VPN Interconnection service:

Log in to the Salt Master node.
Open the cluster.<cluster_name>.openstack.init.yml file for editing.
Set the neutron_enable_bgp_vpn parameter to True.
Set the driver neutron_bgp_vpn_driver parameter to one of the following values: bagpipe, opendaylight, opencontrail. For example:
```
parameters:
  _param:
    neutron_enable_bgp_vpn: True
    neutron_bgp_vpn_driver: bagpipe
    ...
```

Re-apply the Salt configuration:

salt -C 'I@neutron:server' state.sls neutron

For the OpenContrail and OpenDaylight drivers, we assume that the related SDN controllers are already enabled in your MCP cluster. To configure the BaGPipe driver, see Configure the BaGPipe driver for BGP VPN.

See also

OpenStack documentation

Configure the BaGPipe driver for BGP VPN¶

The BaGPipe driver is a lightweight implementation of the BGP-based VPNs used as a reference back end for the Neutron BGP VPN Interconnection service.

For the instruction below, we assume that the Neutron BGP VPN Interconnection service is already enabled on the OpenStack controller nodes. To enable BGP VPN, see Enable the BGP VPN Interconnection service.

To configure the BaGPipe driver:

Log in to the Salt Master node.
Open the cluster.<cluster_name>.openstack.compute.yml file for editing.

Add the following parameters:

parameters:
  ...
  neutron:
    compute:
      bgp_vpn:
        enabled: True
        driver: bagpipe
        bagpipe:
          local_address: <IP address used for BGP peerings>
          peers: <IP addresses of BGP peers>
          autonomous_system: <BGP Autonomous System Number>
          enable_rtc: True # Enable RT Constraint (RFC4684)
      backend:
        extension:
          bagpipe_bgpvpn:
            enabled: True

Re-apply the Salt configuration:

salt -C 'I@neutron:compute' state.sls neutron

Note

If BaGPipe is to be enabled on several compute nodes, set up Route Reflector to interconnect those BagPipe instances. For more information, see BGP and Route Reflection.

See also

OpenStack documentation

Enable the Networking NW-ODL ML2 plugin¶

This section explains how to enable the Networking OpenDaylight (NW-ODL) Modular Layer 2 (ML2) plugin for Neutron in your deployment using the Neutron Salt formula, which can install the networking-odl package and enables Neutron to connect to the OpenDaylight controller.

Note

The procedure assumes that the OpenDaylight controller is already up and running.

To enable the NW-ODL ML2 plugin:

Define the OpenDaylight plugin options in the cluster/<cluster_name>/openstack/init.yml file as follows:

_param:
  opendaylight_service_host: <ODL_controller_IP>
  opendaylight_router: odl-router_v2 # default
  opendaylight_driver: opendaylight_v2 # default
  provider_mappings: physnet1:br-floating # default

In the cluster/<cluster_name>/openstack/control.yml file of your Reclass model, configure the Neutron server by including the system.neutron.control.opendaylight.cluster class and setting credentials and port of OpenDaylight REST API. For example:
```
classes
- system.neutron.control.opendaylight.cluster
parameters:
  neutron:
    server:
      backend:
        rest_api_port: 8282
        user: admin
        password: admin
```
In the cluster/<cluster_name>/openstack/gateway.yml file of your Reclass model, include the following class:
```
classes
- service.neutron.gateway.opendaylight.single
```
In the classes/cluster/<cluster_name>/openstack/compute.yml file of your Reclass model, include the following class:
```
classes
- service.neutron.compute.opendaylight.single
```

Apply the configuration changes by executing the neutron state on all nodes with neutron:server, neutron:gateway, and neutron:compute roles:

salt -I 'neutron:server' state.sls neutron
salt -I 'neutron:gateway' state.sls neutron
salt -I 'neutron:compute' state.sls neutron

See also

Official OpenStack networking-odl documentation

Enable Cross-AZ high availability for Neutron agents¶

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

The Mirantis Cloud Platform (MCP) supports HA with availability zones for Neutron. An availability zone is defined as an agent attribute on a network node and groups network nodes that run services like DHCP, L3, and others. You can associate an availability zone with resources to ensure the resources become HA.

Availability zones provide an extra layer of protection by segmenting the Neutron service deployment in isolated failure domains. By deploying HA nodes across different availability zones, the network services remain available in case of zone-wide failures that affect the deployment. For details, see OpenStack documentation.

Enable Cross-AZ high availability for DHCP¶

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable Cross-AZ high availability for DHCP. As a result, DHCP services will be created in availability zones selected during the network creation.

To enable Cross-AZ high availability for DHCP:

In /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/control.yml, set the following parameters:

parameters:
  neutron:
    server:
      dhcp_agents_per_network: '2'
      dhcp_load_type: 'networks'
      network_scheduler_driver: neutron.scheduler.dhcp_agent_scheduler.AZAwareWeightScheduler

In /srv/salt/reclass/classes/cluster/<cluster_name>/infra/config/nodes.yml, set the availability_zone parameter for each network or gateway node as required:

parameters:
  reclass:
    storage:
      node:
        ...
        openstack_gateway_node<id>:
          parameters:
            neutron:
              gateway:
                availability_zone: <az-name>
        openstack_gateway_node<id+1>:
          parameters:
            neutron:
              gateway:
                availability_zone: <az-name>
       ...

Apply the changes:

salt -C 'I@salt:master' state.sls reclass.storage
salt -C 'I@neutron:server or I@neutron:gateway' saltutil.refresh_pillar
salt -C 'I@neutron:server or I@neutron:gateway' state.sls neutron

Enable Cross-AZ high availability for L3 router¶

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable Cross-AZ high availability for an L3 router. As a result, the L3 router services will be created in availability zones selected during the router creation.

To enable Cross-AZ high availability for L3 routers:

In /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/control.yml, set the following parameters:

parameters:
  neutron:
    server:
      router_scheduler_driver: neutron.scheduler.l3_agent_scheduler.AZLeastRoutersScheduler
      max_l3_agents_per_router: '3'

In /srv/salt/reclass/classes/cluster/<cluster_name>/infra/config/nodes.yml, set the availability_zone parameter for each network or gateway node as required:

parameters:
  reclass:
    storage:
      node:
        ...
        openstack_gateway_node<id>:
          parameters:
           neutron:
              gateway:
                availability_zone: <az-name>
        openstack_gateway_node<id+1>:
          parameters:
            neutron:
              gateway:
                availability_zone: <az-name>
       ...

Apply the changes:

salt -C 'I@salt:master' state.sls reclass.storage
salt -C 'I@neutron:server or I@neutron:gateway' saltutil.refresh_pillar
salt -C 'I@neutron:server or I@neutron:gateway' state.sls neutron

Ironic operations¶

Ironic is an Administrators only service allowing access to all API requests only to the OpenStack users with the admin or baremetal_admin roles. However, some read-only operations are also available to the users with the baremetal_observer role.

In MCP, Ironic has not been integrated with the OpenStack Dashboard service yet. To manage and use Ironic, perform any required actions either through the Bare Metal service command-line client using the ironic or openstack baremetal commands, from scripts using the ironicclient Python API, or through direct REST API interactions.

Managing and using Ironic include creating suitable images, enrolling bare metal nodes into Ironic and configuring them appropriately, and adding compute flavors that correspond to the available bare metal nodes.

Prepare images for Ironic¶

To provision bare metal servers using Ironic, you need to create special images and upload them to Glance.

The configuration of images much depends on an actual hardware. Therefore, they cannot be provided as pre-built images, and you must prepare them after you deploy Ironic.

These images include:

Deploy image that runs the ironic-python-agent required for the deployment and control of bare metal nodes
User image based on the hardware used in your non-virtualized environment

Note

This section explains how to create the required images using the diskimage-builder tool.

Prepare deploy images¶

A deploy image is the image that the bare metal node is PXE-booted into during the image provisioning or node cleaning. It resides in the node’s RAM and has a special agent running that the ironic-conductor service communicates with to orchestrate the image provisioning and node cleaning.

Such images must contain drivers for all network interfaces and disks of the bare metal server.

Note

This section provides example instructions on how to prepare the required images using the diskimage-builder tool. The steps may differ depending on your specific needs and the builder tool. For more information, see Building or downloading a deploy ramdisk image.

To prepare deploy images:

Create the required image by typing:

diskimage-create <BASE-OS> ironic-agent

Upload the resulting *.kernel and *.initramfs images to Glance as aki and ari images:

To upload an aki image, type:

glance image-create --name <IMAGE_NAME> \
     --disk-format aki \
     --container-format aki \
     --file <PATH_TO_IMAGE_KERNEL>

To upload an ari image, type:

glance image-create --name <IMAGE_NAME> \
     --disk-format ari \
     --container-format ari \
     --file <PATH_TO_IMAGE_INITRAMFS>

Prepare user images¶

Ironic understands two types of user images that include:

Whole disk image
Image of complete operating system with the partition table and partitions
Partition image
Image of root partition only, without the partition table. Such images must have appropriate kernel and initramfs images associated with them.

The partition images can be deployed using one of the following methods:
- netboot (default)
  The node is PXE-booted over network to kernel and ramdisk over TFTP.
- local boot
  During a deployment, the image is modified on a disk to boot from a local disk. See Ironic Advanced features for details.

User images are deployed in a non-virtualized environment on real hardware servers. Therefore, they require all necessary drivers for a given bare metal server hardware to be included, that are disks, NICs, and so on.

Note

To prepare whole disk images:

Use standard cloud images as whole disk images if they contain all necessary drivers. Otherwise, rebuild a cloud image by typing:

diskimage-create <base system> -p <EXTRA_PACKAGE_TO_INSTALL> [-p ..]

To prepare partition images for netboot:

Use the images from UEC cloud images that have kernel and initramfs as separate images if they contain all the required drivers.
If additional drivers are required, rebuild the standard whole disk cloud image adding the packages as follows:
```
diskimage-create <BASE_SYSTEM>> baremetal -p <EXTRA_PACKAGE_TO_INSTALL> [-p ..]
```

Upload images to Glance in the following formats:

For an aki image for kernel, type:

glance image-create --name <IMAGE_NAME> \
     --disk-format aki \
     --container-format aki \
     --file <PATH_TO_IMAGE_KERNEL>

For an ari image for initramfs, type:

glance image-create --name <IMAGE_NAME> \
     --disk-format ari \
     --container-format ari \
     --file <PATH_TO_IMAGE_INITRAMFS>

For a rootfs or whole disk image in the output format (qcow2 by default) specified during rebuild, type:

glance image-create --name <IMAGE_NAME> \
     --disk-format <'QCOW2'_FROM_THE_ABOVE_COMMAND> \
     --container-format <'BARE'_FROM_THE_ABOVE_COMMAND> \
     --kernel-id <UUID_OF_UPLOADED_AKI_IMAGE> \
     --ramdisk-id <UUID_OF_UPLOADED_ARI_IMAGE> \
     --file <PATH_TO_IMAGE_KERNEL>

Note

For rootfs images, set the kernel_id and ramdisk_id image properties to UUIDs of the uploaded aki and ari images respectively.

To prepare partition images for local boot:

Use the images from UEC cloud images that have kernel and initramfs as separate images if they contain all the required drivers.
If additional drivers are required, rebuild the standard whole disk cloud image adding the packages as follows:

Caution

Verify that the base operating system has the grub2 package available for installation. And enable it during the rebuild as illustrated in the command below.
```
diskimage-create <BASE_SYSTEM> baremetal grub2 -p <EXTRA_PACKAGE_TO_INSTALL> [-p ..]
```

See also

Diskimage-builder Documentation

Add bare metal nodes¶

This section describes the main steps to enroll a bare metal node to Ironic and make it available for provisioning.

To enroll and configure bare metal nodes:

Enroll new nodes to Ironic using the ironic node-create command:

ironic node-create \
    --name <node-name> \
    --driver <driver-name> \
    --driver-info deploy_ramdisk=<glance UUID of deploy image ramdisk> \
    --driver-info deploy_kernel=<glance UUID of deploy image kernel> \
    --driver-info ipmi_address=<IPMI address of the node> \
    --driver-info ipmi_username=<username for IPMI> \
    --driver-info ipmi_password=<password for the IPMI user> \
    --property memory_mb=<RAM size of the node in MiB> \
    --property cpus=<Number of CPUs on the node> \
    --property local_gb=<size of node's disk in GiB> \
    --property cpu_arch=<architecture of node's CPU>

Where the local_gb property is the size of the biggest disk of the node. We recommend setting it to a 1 GB smaller size than the actual size to accommodate for the partitions table to be created and the extra configuration drive partition.

Add ports for the node that correspond to the actual NICs of the node:
```
ironic port-create --node <UUID_OF_IRONIC_NODE> --address <MAC_ADDRESS>
```
Note

At least one port for the node must be created for the NIC that is attached to the provisioning network and from which the node can boot over PXE.

Alternatively, enroll the nodes by adding them to the Reclass model on the cluster level:

parameters:
  ironic:
    client:
      enabled: true
      nodes:
        admin_identity:
          - name: <node-name>
            driver: pxe_ipmitool
            properties:
              local_gb: <size of node's disk in GiB>
              cpus: <Number of CPUs on the node>
              memory_mb: <RAM size of the node in MiB>
              cpu_arch: <architecture of node's CPU>
            driver_info:
              ipmi_username: <username for IPMI>
              ipmi_password: <password for the IPMI user>
              ipmi_address: <IPMI address of the node>
            ports:
              - address: <MAC address of the node port1>
              - address: <MAC address of the node port2>

Create compute flavors¶

The appropriately created compute flavors allows for proper compute service scheduling of workloads to bare metal nodes.

To create nova flavors:

Create a flavor using the nova flavor-create command:
```
nova flavor-create <FLAVOR_NAME> <UUID_OR_'auto'> <RAM> <DISK> <CPUS>
```
Where RAM, DISK, and CPUS equal to the corresponding properties set on the bare metal nodes.
Use the above command to create flavors for each type of bare metal nodes you need to differentiate.

Provision instances¶

After Ironic nodes, ports, and flavors have been successfully configured, deploy the nova-compute instances to the bare metal nodes using the nova boot command:

nova boot <server name> \
    --image <IMAGE_NAME_OR_ID> \
    --flavor <BAREMETAL_FLAVOR_NAME_OR_ID> \
    --nic net-id=<ID_OF_SHARED_BAREMETAL_NETWORK>

Enable SSL on Ironic internal API¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can enable SSL for all OpenStack components while generating a deployment metadata model using the Model Designer UI before deploying a new OpenStack environment. You can also enable SSL on Ironic internal API on an existing OpenStack environment.

The example instruction below describes the following Ironic configuration:

The OpenStack Ironic API service runs on the OpenStack ctl nodes.
The OpenStack Ironic deploy API and conductor services run on the bmt nodes.

You may need to modify this example configuration depending on the needs of your deployment.

To enable SSL on Ironic internal API on an existing MCP cluster:

Open your Git project repository with the Reclass model on the cluster level.

Modify ./openstack/baremetal.yml as follows:

classes:
- system.salt.minion.cert.openstack_api
- system.apache.server.proxy
- system.apache.server.proxy.openstack.ironic
parameters:
  _param:
    apache_proxy_openstack_api_address: ${_param:cluster_baremetal_local_address}
    apache_proxy_openstack_api_host: ${_param:cluster_baremetal_local_address}
    ironic_conductor_api_url_protocol: https
    openstack_api_cert_alternative_names: IP:127.0.0.1,IP:${_param:cluster_baremetal_local_address},IP:${_param:cluster_baremetal_vip_address},DNS:${linux:system:name},DNS:${linux:network:fqdn},DNS:$  {_param:cluster_baremetal_local_address},DNS:${_param:cluster_baremetal_vip_address}
    apache_ssl:
      enabled: true
      authority: "${_param:salt_minion_ca_authority}"
      key_file: ${_param:openstack_api_cert_key_file}
      cert_file: ${_param:openstack_api_cert_cert_file}
      chain_file: ${_param:openstack_api_cert_all_file}
    apache_proxy_openstack_ironic_host: 127.0.0.1
    haproxy_https_check_options:
    - httpchk GET /
    - httpclose
    - tcplog
    haproxy_ironic_deploy_check_params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3 check-ssl verify none
  haproxy:
    proxy:
      listen:
        ironic_deploy:
          type: None
          mode: tcp
          options: ${_param:haproxy_https_check_options}
  ironic:
    api:
      bind:
        address: 127.0.0.1

Modify ./openstack/control.yml as follows:

classes:
- system.apache.server.proxy.openstack.ironic
parameters:
  _param:
    apache_proxy_openstack_ironic_host: 127.0.0.1
    haproxy_ironic_check_params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3 check-ssl verify none
  haproxy:
    proxy:
      listen:
        ironic:
          type: None
          mode: tcp
          options: ${_param:haproxy_https_check_options}
  ironic:
    api:
      bind:
        address: 127.0.0.1

Modify ./openstack/control/init.yml as follows:

parameters:
  _param:
    ironic_service_protocol: ${_param:cluster_internal_protocol}

Modify ./openstack/init.yml as follows:

parameters:
  _param:
    ironic_service_host: ${_param:openstack_service_host}
    ironic_service_protocol: ${_param:cluster_internal_protocol}

Modify ./openstack/proxy.yml as follows:

parameters:
  _param:
    nginx_proxy_openstack_ironic_protocol: https

Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```

Apply the following Salt states:

salt 'bmt*' state.apply salt
salt -C 'I@ironic:api' state.apply apache
salt 'prx*' state.apply nginx
salt -C 'I@ironic:api' state.apply haproxy
salt -C 'I@ironic:api' state.apply ironic

Enable the networking-generic-switch driver¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Note

This feature is available as technical preview. Use such configuration for testing and evaluation purposes only.

The networking-generic-switch ML2 mechanism driver in Neutron implements the features required for multitenancy support on the Ironic bare metal nodes. This driver requires the corresponding configuration of the Neutron server service.

To enable the networking-generic-switch driver:

Log in to the Salt Master node.
Open the cluster level of your deployment model.

In openstack/control.yml, add pillars for networking-generic-switch using the example below:

parameters:
  ...
  neutron:
    server:
      backend:
        mechanism:
          ngs:
            driver: genericswitch
      n_g_s:
        enabled: true
        coordination: # optional
          enabled: true
          backend_url: "etcd3+http://1.2.3.4:2379"
        devices:
          s1brbm:
            options:
              device_type:
                value: netmiko_ovs_linux
              ip:
                value: 1.2.3.4
              username:
                value: ngs_ovs_manager
              password:
                value: password

Apply the new configuration for the Neutron server:

salt -C 'I@neutron:server' saltutil.refresh_pillar
salt -C 'I@neutron:server' state.apply neutron.server

Troubleshoot Ironic¶

The most possible and typical failures of Ironic are caused by the following peculiarities of the service design:

Ironic is sensitive to possible time difference between the nodes that host the ironic-api and ironic-conductor services.

One of the symptoms of time being out of sync is inability to enroll a bare metal node into Ironic with the error message No conductor service registered which supports driver <DRIVER_NAME> Although, the DRIVER_NAME driver is known to be enabled and is shown in the output of the ironic driver-list command.

To fix the issue, verify that the time is properly synced between the nodes.
Ironic requires IPMI access credentials for the nodes to have the admin privilege level. Any lower privilege level, for example, engineer precludes Ironic from functioning properly.

Designate operations¶

After you deploy an MCP cluster that includes Designate, you can start creating DNS zones and zone records as well as configure auto-generation of records in DNS zones.

Create a DNS zone and record¶

This section describes how to create a DNS zone and a record in the created DNS zone on the MCP cluster where Designate is deployed.

To create a DNS zone and record:

Create a test DNS zone called testdomain.tld. by running the following command against one of the controller nodes where Designate is deployed. For example, ctl01.

salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack zone create \
--email dnsmaster@testdomain.tld testdomain.tld."

Once the change is applied to one controller node, the updated distributed database replicates this change between all controller nodes.

Example of system response:

ctl01.virtual-mcp-ocata-ovs.local:
 +----------------+--------------------------------------+
 | Field          | Value                                |
 +----------------+--------------------------------------+
 | action         | CREATE                               |
 | attributes     |                                      |
 | created_at     | 2017-08-01T12:25:33.000000           |
 | description    | None                                 |
 | email          | dnsmaster@testdomain.tld             |
 | id             | ce9836a9-ba78-4960-9c89-6a4989a9e095 |
 | masters        |                                      |
 | name           | testdomain.tld.                      |
 | pool_id        | 794ccc2c-d751-44fe-b57f-8894c9f5c842 |
 | project_id     | 49c11a3aa9534d8b897cf06890871840     |
 | serial         | 1501590333                           |
 | status         | PENDING                              |
 | transferred_at | None                                 |
 | ttl            | 3600                                 |
 | type           | PRIMARY                              |
 | updated_at     | None                                 |
 | version        | 1                                    |
 +----------------+--------------------------------------+

Verify that a DNS zone is successfully created and is in the ACTIVE status:

salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack zone list"

Example of system response:

ctl01.virtual-mcp-ocata-ovs.local:
 +------------------------------------+---------------+-------+-----------+------+------+
 |id                                  |name           |type   |serial     |status|action|
 +------------------------------------+---------------+-------+-----------+------+------+
 |571243e5-17dd-49bd-af09-de6b0c175d8c|example.tld.   |PRIMARY| 1497877051|ACTIVE|NONE  |
 |7043de84-3a40-4b44-ad4c-94dd1e802370|domain.tld.    |PRIMARY| 1498209223|ACTIVE|NONE  |
 |ce9836a9-ba78-4960-9c89-6a4989a9e095|testdomain.tld.|PRIMARY| 1501590333|ACTIVE|NONE  |
 +------------------------------------+---------------+-------+-----------+------+------+

Create a record in the new DNS zone by running the command below. Use any IPv4 address to test that it works. For example, 192.168.0.1.

salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack recordset create \
--records '192.168.0.1' --type A testdomain.tld. tstserver01"

Example of system response:

ctl01.virtual-mcp-ocata-ovs.local:
 +-------------+--------------------------------------+
 | Field       | Value                                |
 +-------------+--------------------------------------+
 | action      | CREATE                               |
 | created_at  | 2017-08-01T12:28:37.000000           |
 | description | None                                 |
 | id          | d099f013-460b-41ee-8cf1-3cf0e3c49bc7 |
 | name        | tstserver01.testdomain.tld.         |
 | project_id  | 49c11a3aa9534d8b897cf06890871840     |
 | records     | 192.168.0.1                          |
 | status      | PENDING                              |
 | ttl         | None                                 |
 | type        | A                                    |
 | updated_at  | None                                 |
 | version     | 1                                    |
 | zone_id     | ce9836a9-ba78-4960-9c89-6a4989a9e095 |
 | zone_name   | testdomain.tld.                      |
 +-------------+--------------------------------------+

Verify that the record is successfully created and is in the ACTIVE status by running the openstack recordset list [zone_id] command. The zone_id parameter can be found in the output of the command described in the previous step.

Example:

salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack recordset list \
ce9836a9-ba78-4960-9c89-6a4989a9e095"

ctl01.virtual-mcp-ocata-ovs.local:
+---+---------------------------+----+----------------------------------------------------------+------+------+
| id| name                      |type|records                                                   |status|action|
+---+---------------------------+----+----------------------------------------------------------+------+------+
|...|testdomain.tld.            |SOA |ns1.example.org. dnsmaster.testdomain.tld. 1501590517 3598|ACTIVE|NONE  |
|...|testdomain.tld.            |NS  |ns1.example.org.                                          |ACTIVE|NONE  |
|...|tstserver01.testdomain.tld.|A   |192.168.0.1                                               |ACTIVE|NONE  |
+---+---------------------------+----+----------------------------------------------------------+------+------+

Verify that the DNS record can be resolved by running the nslookup tstserver01.domain.tld [dns server address] command. In the example below, the DNS server address of the Designate back end is 10.0.0.1.

Example:
```
nslookup tstserver01.testdomain.tld 10.0.0.1

Server:     10.0.0.1
Address:    10.0.0.1#53
Name:   tstserver01.testdomain.tld
Address: 192.168.0.1
```

Configure auto-generation of records in a DNS zone¶

After you create a DNS zone and a record for this zone as described in Create a DNS zone and record, you can configure auto-generation of records in the created DNS zone.

To configure auto-generation of records in the created DNS zone:

In your Git project repository, change the directory to classes/cluster/<cluster_name>/openstack/.
In init.yml, set the designate_domain_id parameter according to the created DNS zone. For example:
```
designate_domain_id: ce9836a9-ba78-4960-9c89-6a4989a9e095
```
Refresh pillars on the Salt Minion nodes:
```
salt '*' saltutil.pillar_refresh
```

Apply the Designate states:

salt -C 'I@designate:server and *01*' state.sls designate.server
salt -C 'I@designate:server' state.sls designate

Using the Nova CLI, boot the VM which you have created a DNS zone for.

Verify that the DNS record related to the VM was created by running the salt 'ctl01*' cmd.run "openstack recordset list [zone_id]" command. For example:

salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack recordset list \
ce9836a9-ba78-4960-9c89-6a4989a9e095"

Example of system response:

ctl01.virtual-mcp-ocata-ovs.local:
+------------------------------------+---------------------------+----+-----------+------+------+
|id                                  |name                       |type|records    |status|action|
+------------------------------------+---------------------------+----+-----------+------+------+
|d099f013-460b-41ee-8cf1-3cf0e3c49bc7|tstserver01.testdomain.tld.|A   |192.168.0.1|ACTIVE|NONE  |
+------------------------------------+---------------------------+----+-----------+------+------+

Ceph operations¶

Ceph is a storage back end for cloud environments. After you successfully deploy a Ceph cluster, you can manage its nodes and object storage daemons (Ceph OSDs). This section describes how to add Ceph Monitor, Ceph OSD, and RADOS Gateway nodes to an existing Ceph cluster or remove them, as well as how to remove or replace Ceph OSDs.

Prerequisites¶

Before you proceed to manage Ceph nodes and OSDs, or upgrade Ceph, perform the steps below.

Verify that your Ceph cluster is up and running.
Log in to the Salt Master node.
Add Ceph pipelines to DriveTrain.
1. Add the following class to the cluster/cicd/control/leader.yml file:
```
classes:
- system.jenkins.client.job.ceph
```
2. Apply the salt -C 'I@jenkins:client' state.sls jenkins.client state.

Manage Ceph nodes¶

This section describes how to add Ceph Monitor, Ceph OSD, and RADOS Gateway nodes to an existing Ceph cluster or remove them.

Add a Ceph Monitor node¶

This section describes how to add a Ceph Monitor node to an existing Ceph cluster.

Warning

Prior to the 2019.2.10 maintenance update, this feature is available as technical preview only.

Note

The Ceph Monitor service is quorum-based. Therefore, keep an odd number of Ceph Monitor nodes to establish a quorum.

To add a Ceph Monitor node:

In your project repository, add the following lines to the cluster/ceph/init.yml file and modify them according to your environment:

_param:
   ceph_mon_node04_hostname: cmn04
   ceph_mon_node04_address: 172.16.47.145
   ceph_mon_node04_ceph_public_address: 10.13.0.4
   ceph_mon_node04_deploy_address: 192.168.0.145
linux:
  network:
    host:
      cmn04:
        address: ${_param:ceph_mon_node04_address}
        names:
        - ${_param:ceph_mon_node04_hostname}
        - ${_param:ceph_mon_node04_hostname}.${_param:cluster_domain}

Note

Skip the ceph_mon_node04_deploy_address parameter if you have DHCP enabled on a PXE network.

Define the backup configuration for the new node in cluster/ceph/init.yml. For example:

parameters:
  _param:
    ceph_mon_node04_ceph_backup_hour: 4
    ceph_mon_node04_ceph_backup_minute: 0

Add the following lines to the cluster/ceph/common.yml file and modify them according to your environment:

parameters:
  ceph:
    common:
      members:
        - name: ${_param:ceph_mon_node04_hostname}
          host: ${_param:ceph_mon_node04_address}

Add the following lines to the cluster/infra/config/nodes.yml file:

parameters:
  reclass:
    storage:
      node:
        ceph_mon_node04:
          name: ${_param:ceph_mon_node04_hostname}
          domain: ${_param:cluster_domain}
          classes:
          - cluster.${_param:cluster_name}.ceph.mon
          params:
            ceph_public_address: ${_param:ceph_mon_node04_ceph_public_address}
            ceph_backup_time_hour: ${_param:ceph_mon_node04_ceph_backup_hour}
            ceph_backup_time_minute: ${_param:ceph_mon_node04_ceph_backup_minute}
            salt_master_host: ${_param:reclass_config_master}
            linux_system_codename: ${_param:ceph_mon_system_codename}
            single_address: ${_param:ceph_mon_node04_address}
            deploy_address: ${_param:ceph_mon_node04_deploy_address}
            ceph_public_address: ${_param:ceph_mon_node04_public_address}
            keepalived_vip_priority: 104

Note

Skip the deploy_address parameter if you have DHCP enabled on a PXE network.

Add the following lines to the cluster/infra/kvm.yml file and modify infra_kvm_node03_hostname depending on which KVM node the Ceph Monitor node should run on:

parameters:
  salt:
    control:
      size:
        ceph.mon:
          cpu: 8
          ram: 16384
          disk_profile: small
          net_profile: default
      cluster:
        internal:
          node:
            cmn04:
              name: ${_param:ceph_mon_node04_hostname}
              provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
              image: ${_param:salt_control_xenial_image}
              size: ceph.mon

Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```
Log in to the Jenkins web UI.
Open the Ceph - add node pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the new Ceph Monitor node. For example, `cmn04*`.
HOST_TYPE ^{Removed since 2019.2.13 update}	Add mon as the type of Ceph node that is going to be added.

Click Deploy.

The Ceph - add node pipeline workflow:

Launch the Ceph Monitor VMs.
Run the reclass state.
Run the linux, openssh, salt, ntp, rsyslog, ceph.mon states.
Update ceph.conf files on all Ceph nodes.
Run the ceph.mgr state if the pillar is present.

Add a Ceph OSD node¶

This section describes how to add a Ceph OSD node to an existing Ceph cluster.

Warning

Prior to the 2019.2.10 maintenance update, this feature is available as technical preview only.

To add a Ceph OSD node:

Connect the Ceph OSD salt-minion node to salt-master.

In your project repository, if the nodes are not generated dynamically, add the following lines to cluster/ceph/init.yml and modify according to your environment:

_param:
   ceph_osd_node05_hostname: osd005
   ceph_osd_node05_address: 172.16.47.72
   ceph_osd_node05_backend_address: 10.12.100.72
   ceph_osd_node05_public_address: 10.13.100.72
   ceph_osd_node05_deploy_address: 192.168.0.72
   ceph_osd_system_codename: xenial
linux:
  network:
    host:
      osd005:
        address: ${_param:ceph_osd_node05_address}
        names:
        - ${_param:ceph_osd_node05_hostname}
        - ${_param:ceph_osd_node05_hostname}.${_param:cluster_domain}

Note

Skip the ceph_osd_node05_deploy_address parameter if you have DHCP enabled on a PXE network.

If the nodes are not generated dynamically, add the following lines to the cluster/infra/config/nodes.yml and modify according to your environment. Otherwise, increase the number of generated OSDs.

parameters:
  reclass:
    storage:
      node:
        ceph_osd_node05:
          name: ${_param:ceph_osd_node05_hostname}
          domain: ${_param:cluster_domain}
          classes:
          - cluster.${_param:cluster_name}.ceph.osd
          params:
            salt_master_host: ${_param:reclass_config_master}
            linux_system_codename:  ${_param:ceph_osd_system_codename}
            single_address: ${_param:ceph_osd_node05_address}
            deploy_address: ${_param:ceph_osd_node05_deploy_address}
            backend_address: ${_param:ceph_osd_node05_backend_address}
            ceph_public_address: ${_param:ceph_osd_node05_public_address}
            ceph_crush_parent: rack02

Note

Skip the deploy_address parameter if you have DHCP enabled on a PXE network.

^{Since 2019.2.3, skip this step} Verify that the cluster/ceph/osd.yml file and the pillar of the new Ceph OSD do not contain the following lines:
```
parameters:
  ceph:
    osd:
      crush_update: false
```
Log in to the Jenkins web UI.
Select from the following options:
- For MCP versions starting from the 2019.2.10 maintenance update, open the Ceph - add osd (upmap) pipeline.
- For MCP versions prior to the 2019.2.10 maintenance update, open the Ceph - add node pipeline.
Note

Prior to the 2019.2.10 maintenance update, the Ceph - add node and Ceph - add osd (upmap) Jenkins pipeline jobs are available as technical preview only.

Caution

A large change in the crush weights distribution after the addition of Ceph OSDs can cause massive unexpected rebalancing, affect performance, and in some cases can cause data corruption. Therefore, if you are using Ceph - add node, Mirantis recommends that you add all disks with zero weight and reweight them gradually.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the new Ceph OSD. For example, `osd005*`.
HOST_TYPE ^{Removed since 2019.2.3 update}	Add osd as the type of Ceph node that is going to be added.
CLUSTER_FLAGS ^{Added since 2019.2.7 update}	Add a comma-separated list of flags to check after the pipeline execution.
USE_UPMAP ^{Added since 2019.2.13 update}	Use to facilitate the `upmap` module during rebalancing to minimize impact on cluster performance.

Click Deploy.

The Ceph - add node pipeline workflow prior to the 2019.2.3 maintenance update:
1. Apply the reclass state.
2. Apply the linux, openssh, salt, ntp, rsyslog, ceph.osd states.
The Ceph - add node pipeline workflow starting from 2019.2.3 maintenance update:
1. Apply the reclass state.
2. Verify that all installed Ceph clients have the Luminous version.
3. Apply the linux, openssh, salt, ntp, rsyslog, states.
4. Set the Ceph cluster compatibility to Luminous.
5. Switch the balancer module to the upmap mode.
6. Set the norebalance flag before adding a Ceph OSD.
7. Apply the ceph.osd state on the selected Ceph OSD node.
8. Update the mappings for the remapped placement group (PG) using upmap back to the old Ceph OSDs.
9. Unset the norebalance flag and verify that the cluster is healthy.
If you use a custom CRUSH map, update the CRUSH map:
1. Verify the updated /etc/ceph/crushmap file on cmn01. If correct, apply the CRUSH map using the following commands:
```
crushtool -c /etc/ceph/crushmap -o /etc/ceph/crushmap.compiled
ceph osd setcrushmap -i /etc/ceph/crushmap.compiled
```
2. Add the following lines to the cluster/ceph/osd.yml file:
```
parameters:
  ceph:
    osd:
      crush_update: false
```
3. Apply the ceph.osd state to persist the CRUSH map:
```
salt -C 'I@ceph:osd' state.sls ceph.osd
```

Integrate the Ceph OSD nodes with StackLight:

Update the Salt mine:

salt -C 'I@ceph:osd or I@telegraf:remote_agent' state.sls salt.minion.grains
salt -C 'I@ceph:osd or I@telegraf:remote_agent' saltutil.refresh_modules
salt -C 'I@ceph:osd or I@telegraf:remote_agent' mine.update

Wait for one minute.

Apply the following states:

salt -C 'I@ceph:osd or I@telegraf:remote_agent' state.sls telegraf
salt -C 'I@ceph:osd' state.sls fluentd
salt 'mon*' state.sls prometheus

Add a RADOS Gateway node¶

This section describes how to add a RADOS Gateway (rgw) node to an existing Ceph cluster.

To add a RADOS Gateway node:

In your project repository, add the following lines to the cluster/ceph/init.yml and modify them according to your environment:

_param:
  ceph_rgw_node04_hostname: rgw04
  ceph_rgw_node04_address: 172.16.47.162
  ceph_rgw_node04_ceph_public_address: 10.13.0.162
  ceph_rgw_node04_deploy_address: 192.168.0.162
linux:
  network:
    host:
      rgw04:
        address: ${_param:ceph_rgw_node04_address}
        names:
        - ${_param:ceph_rgw_node04_hostname}
        - ${_param:ceph_rgw_node04_hostname}.${_param:cluster_domain}

Note

Skip the ceph_rgw_node04_deploy_address parameter if you have DHCP enabled on a PXE network.

Add the following lines to the cluster/ceph/rgw.yml file:

parameters:
  _param:
    cluster_node04_hostname: ${_param:ceph_rgw_node04_hostname}
    cluster_node04_address: ${_param:ceph_rgw_node04_address}
  ceph:
    common:
      keyring:
        rgw.rgw04:
          caps:
            mon: "allow rw"
            osd: "allow rwx"
  haproxy:
    proxy:
      listen:
        radosgw:
          servers:
            - name: ${_param:cluster_node04_hostname}
              host: ${_param:cluster_node04_address}
              port: ${_param:haproxy_radosgw_source_port}
              params: check

Note

Starting from the MCP 2019.2.10 maintenance update, the capabilities for RADOS Gateway have been restricted. To update the existing capabilities, perform the steps described in Restrict the RADOS Gateway capabilities.

Add the following lines to the cluster/infra/config/init.yml file:

parameters:
  reclass:
    storage:
      node:
        ceph_rgw_node04:
          name: ${_param:ceph_rgw_node04_hostname}
          domain: ${_param:cluster_domain}
          classes:
          - cluster.${_param:cluster_name}.ceph.rgw
          params:
            salt_master_host: ${_param:reclass_config_master}
            linux_system_codename:  ${_param:ceph_rgw_system_codename}
            single_address: ${_param:ceph_rgw_node04_address}
            deploy_address: ${_param:ceph_rgw_node04_deploy_address}
            ceph_public_address: ${_param:ceph_rgw_node04_ceph_public_address}
            keepalived_vip_priority: 104

Note

Skip the deploy_address parameter if you have DHCP enabled on a PXE network.

Add the following lines to the cluster/infra/kvm.yml file and modify infra_kvm_node03_hostname depending on which KVM node the rgw must be running on:

parameters:
  salt:
    control:
      size:
        ceph.rgw:
          cpu: 8
          ram: 16384
          disk_profile: small
          net_profile: default
      cluster:
        internal:
          node:
            rgw04:
              name: ${_param:ceph_rgw_node04_hostname}
              provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
              image: ${_param:salt_control_xenial_image}
              size: ceph.rgw

Log in to the Jenkins web UI.
Open the Ceph - add node pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the new RADOS Gateway node. For example, `rgw04*`.
HOST_TYPE ^{Removed since 2019.2.13 update}	Add rgw as the type of Ceph node that is going to be added.

Click Deploy.

The Ceph - add node pipeline workflow:

Launch RADOS Gateway VMs.
Run the reclass state.
Run the linux, openssh, salt, ntp, rsyslog, keepalived, haproxy, ceph.radosgw states.

Add a Ceph OSD daemon¶

This section describes how to add new or re-add the existing Ceph OSD daemons on an existing Ceph OSD node.

Note

This feature is available starting from the MCP 2019.2.13 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Note

The pipeline used in this section is a wrapper for Ceph - add node, which simplifies common operations.

To add a new or re-add the existing Ceph OSD daemon:

If you are adding a new Ceph OSD daemon, perform the following prerequisite steps. Otherwise, proceed to the next step.
1. Open your Git project repository with the Reclass model on the cluster level.
2. In cluster/ceph/osd.yml, add the new Ceph OSD daemon definition. Use the existing definition as a template.
Log in to the Jenkins web UI.
Select from the following options:
- For MCP version 2019.2.13, open the Ceph - add osd (upmap) Jenkins pipeline job.
- For MCP versions starting from 2019.2.14, open the Ceph - add osd Jenkins pipeline job.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	The Salt target name of the host to which the Ceph OSD daemons are going to be added. For example, `osd005*`.
CLUSTER_FLAGS	A comma-separated list of flags to check after the pipeline execution.

Click Deploy.

The Ceph - add osd pipeline runs Ceph - add node with the following predefined values:
- OSD_ONLY is set to True to omit enforcing the node configuration because a working node is already configured in the cluster.
- USE_UPMAP is set to True to gradually add Ceph OSD daemons to the cluster and prevent consuming excessive I/O for rebalancing.

Remove a Ceph Monitor node¶

This section describes how to remove a Ceph Monitor node from a Ceph cluster.

Note

The Ceph Monitor service is quorum-based. Therefore, keep an odd number of Ceph Monitor nodes to establish a quorum.

To remove a Ceph Monitor node:

In your project repository, remove the following lines from the cluster/infra/config/init.yml file or from the pillar based on your environment:

parameters:
  reclass:
    storage:
      node:
        ceph_mon_node04:
          name: ${_param:ceph_mon_node04_hostname}
          domain: ${_param:cluster_domain}
          classes:
          - cluster.${_param:cluster_name}.ceph.mon
          params:
            salt_master_host: ${_param:reclass_config_master}
            linux_system_codename:  ${_param:ceph_mon_system_codename}
            single_address: ${_param:ceph_mon_node04_address}
            keepalived_vip_priority: 104

Remove the following lines from the cluster/ceph/common.yml file or from the pillar based on your environment:

parameters:
  ceph:
    common:
      members:
        - name: ${_param:ceph_mon_node04_hostname}
          host: ${_param:ceph_mon_node04_address}

Log in to the Jenkins web UI.
Open the Ceph - remove node pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the Ceph Monitor node to remove. For example, `cmn04*`.
HOST_TYPE ^{Removed since 2019.2.13 update}	Add mon as the type of Ceph node that is going to be removed.

Click Deploy.

The Ceph - remove node pipeline workflow:
1. Reconfigure the configuration file on all ceph:common minions.
2. Destroy the VM.
3. Remove the Salt Minion node ID from salt-key on the Salt Master node.

Remove the following lines from the cluster/infra/kvm.yml file or from the pillar based on your environment:

parameters:
  salt:
    control:
      cluster:
        internal:
          node:
            cmn04:
              name: ${_param:ceph_mon_node04_hostname}
              provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
              image: ${_param:salt_control_xenial_image}
              size: ceph.mon

Remove the following lines from the cluster/ceph/init.yml file or from the pillar based on your environment:

_param:
   ceph_mon_node04_hostname: cmn04
   ceph_mon_node04_address: 172.16.47.145
linux:
  network:
    host:
      cmn04:
        address: ${_param:ceph_mon_node04_address}
        names:
        - ${_param:ceph_mon_node04_hostname}
        - ${_param:ceph_mon_node04_hostname}.${_param:cluster_domain}

Remove a Ceph OSD node¶

This section describes how to remove a Ceph OSD node from a Ceph cluster.

To remove a Ceph OSD node:

If the host is explicitly defined in the model, perform the following steps. Otherwise, proceed to step 2.

In your project repository, remove the following lines from the cluster/ceph/init.yml file or from the pillar based on your environment:

_param:
   ceph_osd_node05_hostname: osd005
   ceph_osd_node05_address: 172.16.47.72
   ceph_osd_system_codename: xenial
linux:
  network:
    host:
      osd005:
        address: ${_param:ceph_osd_node05_address}
        names:
        - ${_param:ceph_osd_node05_hostname}
        - ${_param:ceph_osd_node05_hostname}.${_param:cluster_domain}

Remove the following lines from the cluster/infra/config/init.yml file or from the pillar based on your environment:

parameters:
  reclass:
    storage:
      node:
        ceph_osd_node05:
          name: ${_param:ceph_osd_node05_hostname}
          domain: ${_param:cluster_domain}
          classes:
          - cluster.${_param:cluster_name}.ceph.osd
          params:
            salt_master_host: ${_param:reclass_config_master}
            linux_system_codename:  ${_param:ceph_osd_system_codename}
            single_address: ${_param:ceph_osd_node05_address}
            ceph_crush_parent: rack02

Log in to the Jenkins web UI.
Open the Ceph - remove node pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the Ceph OSD node to remove. For example, `osd005*`.
HOST_TYPE ^{Removed since 2019.2.13 update}	Add osd as the type of Ceph node that is going to be removed.
OSD ^{Added since 2019.2.13 update}	Specify the list of Ceph OSDs to remove while keeping the rest and the entire node as part of the cluster. To remove all, leave empty or set to `*`.
GENERATE_CRUSHMAP	Select if the CRUSH map file should be updated. Enforce has to happen manually unless it is specifically set to be enforced in pillar.
ADMIN_HOST ^{Removed since 2019.2.13 update}	Add `cmn01*` as the Ceph cluster node with the admin keyring.
WAIT_FOR_HEALTHY	Mandatory since the 2019.2.13 maintenance update. Verify that this parameter is selected as it enables the Ceph health check within the pipeline.
CLEANDISK ^{Added since 2019.2.10 update}	Mandatory since the 2019.2.13 maintenance update. Select to clean the data or block partitions.
CLEAN_ORPHANS ^{Added since 2019.2.13 update}	Select to clean orphaned disks of Ceph OSDs that are no longer part of the cluster.
FAST_WIPE ^{Added since 2019.2.13 update}	Deselect if the entire disk needs zero filling.

Click Deploy.

The Ceph - remove node pipeline workflow:
1. Mark all Ceph OSDs running on the specified HOST as out. If you selected the WAIT_FOR_HEALTHY parameter, Jenkins pauses the execution of the pipeline until the data migrates to a different Ceph OSD.
2. Stop all Ceph OSDs services running on the specified HOST.
3. Remove all Ceph OSDs running on the specified HOST from the CRUSH map.
4. Remove all Ceph OSD authentication keys running on the specified HOST.
5. Remove all Ceph OSDs running on the specified HOST from Ceph cluster.
6. Purge CEPH packages from the specified HOST.
7. Stop the Salt Minion node on the specified HOST.
8. Remove all Ceph OSDs running on the specified HOST from Ceph cluster.
9. Remove the Salt Minion node ID from salt-key on the Salt Master node.
10. Update the CRUSHMAP file on the I@ceph:setup:crush node if GENERATE_CRUSHMAP was selected. You must manually apply the update unless it is specified otherwise in the pillar.

If you selected GENERATE_CRUSHMAP, check the updated /etc/ceph/crushmap file on cmn01. If it is correct, apply the CRUSH map:

crushtool -c /etc/ceph/crushmap -o /etc/ceph/crushmap.compiled
ceph osd setcrushmap -i /etc/ceph/crushmap.compiled

Remove a RADOS Gateway node¶

This section describes how to remove a RADOS Gateway (rgw) node from a Ceph cluster.

To remove a RADOS Gateway node:

In your project repository, remove the following lines from the cluster/ceph/rgw.yml file or from the pillar based on your environment:

parameters:
  _param:
    cluster_node04_hostname: ${_param:ceph_rgw_node04_hostname}
    cluster_node04_address: ${_param:ceph_rgw_node04_address}
  ceph:
    common:
      keyring:
        rgw.rgw04:
          caps:
            mon: "allow rw"
            osd: "allow rwx"
  haproxy:
    proxy:
      listen:
        radosgw:
          servers:
            - name: ${_param:cluster_node04_hostname}
              host: ${_param:cluster_node04_address}
              port: ${_param:haproxy_radosgw_source_port}
              params: check

Remove the following lines from the cluster/infra/config/init.yml file or from the pillar based on your environment:

parameters:
  reclass:
    storage:
      node:
        ceph_rgw_node04:
          name: ${_param:ceph_rgw_node04_hostname}
          domain: ${_param:cluster_domain}
          classes:
          - cluster.${_param:cluster_name}.ceph.rgw
          params:
            salt_master_host: ${_param:reclass_config_master}
            linux_system_codename:  ${_param:ceph_rgw_system_codename}
            single_address: ${_param:ceph_rgw_node04_address}
            keepalived_vip_priority: 104

Log in to the Jenkins web UI.
Open the Ceph - remove node pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the RADOS Gateway node to remove. For example, `rgw04*`.
HOST_TYPE ^{Removed since 2019.2.13 update}	Add rgw as the type of Ceph node that is going to be removed.
CLEANDISK ^{Added since 2019.2.10 update}	Mandatory since the 2019.2.13 maintenance update. Select to clean the data or block partitions.

Click Deploy.

The Ceph - remove node pipeline workflow:
1. Reconfigure HAProxy on the rest of RADOS Gateway nodes.
2. Destroy the VM.
3. Remove the Salt Minion node ID from salt-key on the Salt Master node.

Remove the following lines from the cluster/infra/kvm.yml file or from the pillar based on your environment:

parameters:
  salt:
    control:
      cluster:
        internal:
          node:
            rgw04:
              name: ${_param:ceph_rgw_node04_hostname}
              provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
              image: ${_param:salt_control_xenial_image}
              size: ceph.rgw

Remove the following lines from the cluster/ceph/init.yml file or from the pillar based on your environment:

_param:
  ceph_rgw_node04_hostname: rgw04
  ceph_rgw_node04_address: 172.16.47.162
linux:
  network:
    host:
      rgw04:
        address: ${_param:ceph_rgw_node04_address}
        names:
        - ${_param:ceph_rgw_node04_hostname}
        - ${_param:ceph_rgw_node04_hostname}.${_param:cluster_domain}

Remove a Ceph OSD daemon¶

Note

This feature is available starting from the MCP 2019.2.13 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Note

The pipeline used in this section is a wrapper for Ceph - remove node, which simplifies common operations.

This section describes how to remove a Ceph OSD daemon from the cluster without removing the entire Ceph OSD node.

To remove a Ceph OSD daemon:

Log in to the Jenkins web UI.
Open the Ceph - remove osd pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	The Salt target name of the host from which the Ceph OSD daemons are going to be removed. For example, `osd005*`.
OSD	A comma-separated list of Ceph OSD daemons to remove while keeping the rest and the entire node as part of the cluster. Do not leave this parameter empty.
WAIT_FOR_HEALTHY	Verify that this parameter is selected as it enables the Ceph health check within the pipeline.
CLEAN_ORPHANS	Select to clean orphaned disks of Ceph OSDs that are no longer part of the cluster.
FAST_WIPE	Deselect if the entire disk needs zero filling.

Click Deploy.

Replace a failed Ceph OSD¶

This section instructs you on how to replace a failed physical node with a Ceph OSD or multiple OSD nodes running on it using the Ceph - replace failed OSD Jenkins pipeline.

To replace a failed physical node with a Ceph OSD or multiple OSD nodes:

Log in to the Jenkins web UI.
Open the Ceph - replace failed OSD pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the Ceph OSD node. For example, `osd005*`.
OSD	Add a comma-separated list of Ceph OSDs on the specified `HOST` node. For example `1,2`.
DEVICE [0]	Add a comma-separated list of failed devices to replace at `HOST`. For example, `/dev/sdb,/dev/sdc`.
DATA_PARTITION: [0]	(Optional) Add a comma-separated list of mounted partitions of the failed device. These partitions will be unmounted. We recommend that multiple OSD nodes per device are used. For example, `/dev/sdb1,/dev/sdb3`.
JOURNAL_BLOCKDB_BLOCKWAL_PARTITION: [0]	Add a comma-separated list of partitions that store `journal`, `block_db`, or `block_wal` of the failed devices on the specified `HOST`. For example, `/dev/sdh2,/dev/sdh3`.
ADMIN_HOST	Add `cmn01*` as the Ceph cluster node with the admin keyring.
CLUSTER_FLAGS	Add a comma-separated list of flags to apply before and after the pipeline.
WAIT_FOR_HEALTHY	Select to perform the Ceph health check within the pipeline.
DMCRYPT [0]	Select if you are replacing an encrypted OSD. In such case, also specify `noout,norebalance` as `CLUSTER_FLAGS`.

Click Deploy.

The Ceph - replace failed OSD pipeline workflow:

Mark the Ceph OSD as out.
Wait until the Ceph cluster is in a healthy state if WAIT_FOR_HEALTHY was selected. In this case. Jenkins pauses the execution of the pipeline until the data migrates to a different Ceph OSD.
Stop the Ceph OSD service.
Remove the Ceph OSD from the CRUSH map.
Remove the Ceph OSD authentication key.
Remove the Ceph OSD from the Ceph cluster.
Unmount data partition(s) of the failed disk.
Delete the partition table of the failed disk.
Remove the partition from the block_db, block_wal, or journal.
Perform one of the following depending on the MCP release version:
- For deployments prior to the MCP 2019.2.3 update, redeploy the failed Ceph OSD.
- For deployments starting from the MCP 2019.2.3 update:
  1. Wait for the hardware replacement and confirmation to proceed.
  2. Redeploy the failed Ceph OSD on the replaced hardware.

Note

If any of the steps 1 - 9 has already been performed manually, Jenkins proceeds to the next step.

[0]	(1, 2, 3, 4) The parameter has been removed starting from the MCP 2019.2.3 update.

Restrict the RADOS Gateway capabilities¶

Note

This feature is available starting from the MCP 2019.2.10 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

To avoid a potential security vulnerability, Mirantis recommends that you restrict the RADOS Gateway capabilities of your existing MCP deployment to a bare minimum.

To restrict the RADOS Gateway capabilities of an existing MCP deployment:

Open your project Git repository with the Reclass model on the cluster level.

In cluster/ceph/rgw.yml, modify the RADOS Gateway capabilities as follows:

ceph:
  common:
    keyring:
      rgw.rgw01:
        caps:
          mon: "allow rw"
          osd: "allow rwx"
      rgw.rgw02:
        caps:
          mon: "allow rw"
          osd: "allow rwx"
      rgw.rgw03:
        caps:
          mon: "allow rw"
          osd: "allow rwx"

Apply the changes:

salt -I ceph:radosgw state.apply ceph.common,ceph.setup.keyring

Enable the Ceph Prometheus plugin¶

If you have deployed StackLight LMA, you can enhance Ceph monitoring by enabling the Ceph Prometheus plugin that is based on the native Prometheus exporter introduced in Ceph Luminous. In this case, the Ceph Prometheus plugin, instead of Telegraf, collects Ceph metrics providing a wider set of graphs in the Grafana web UI, such as an overview of the Ceph cluster, hosts, OSDs, pools, RADOS gateway nodes, as well as detailed graphs on the Ceph OSD and RADOS Gateway nodes. You can enable the Ceph Prometheus plugin manually on an existing MCP cluster as described below or during the upgrade of StackLight LMA as described in Upgrade StackLight LMA using the Jenkins job.

To enable the Ceph Prometheus plugin manually:

Update the Ceph formula package.
Open your project Git repository with Reclass model on the cluster level.
In classes/cluster/cluster_name/ceph/mon.yml, remove the service.ceph.monitoring.cluster_stats class.
In classes/cluster/cluster_name/ceph/osd.yml, remove the service.ceph.monitoring.node_stats class.
Log in to the Salt Master node.
Refresh grains to set the new alerts and graphs:
```
salt '*' state.sls salt.minion.grains
```
Enable the Prometheus plugin:
```
salt -C I@ceph:mon state.sls ceph.mgr
```

Update the targets and alerts in Prometheus:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus

Update the new Grafana dashboards:

salt -C 'I@grafana:client' state.sls grafana

(Optional) Enable the StackLight LMA prediction alerts for Ceph.

Note

This feature is available as technical preview. Use such configuration for testing and evaluation purposes only.

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.
1. Open your project Git repository with Reclass model on the cluster level.
2. In classes/cluster/cluster_name/ceph/common.yml, set enable_prediction to True:
```
parameters:
  ceph:
    common:
      enable_prediction: True
```
3. Log in to the Salt Master node.
4. Refresh grains to set the new alerts and graphs:
```
salt '*' state.sls salt.minion.grains
```
5. Verify and update the alerts thresholds based on the cluster hardware.
  
  Note
  
  For details about tuning the thresholds, contact Mirantis support.
6. Update the targets and alerts in Prometheus:
```
salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus
```
Customize Ceph prediction alerts as described in Ceph.

Enable Ceph compression¶

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

RADOS Gateway supports server-side compression of uploaded objects using the Ceph compression plugins. You can manually enable Ceph compression to rationalize the capacity usage on the MCP cluster.

To enable Ceph compression:

Log in to any rgw node.
Run the radosgw-admin zone placement modify command with the --compression=<type> option specifying the compression plugin type and other options as required. The available compression plugins to use when writing a new object data are zlib, snappy, or zstd. For example:
```
radosgw-admin zone placement modify \
  --rgw-zone default \
  --placement-id default-placement \
  --storage-class STANDARD \
  --compression zlib
```
Note

If you have not previously performed any Multi-site configuration, you can use the default values for the options except compression. To disable compression, set the compression type to an empty string or none.

See also

Ceph compression

Enable the ceph-volume tool¶

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Warning

Prior to the 2019.2.10 maintenance update, ceph-volume is available as technical preview only.
Starting from the 2019.2.10 maintenance update, the ceph-volume tool is fully supported and must be enabled prior to upgrading from Ceph Luminous to Nautilus. The ceph-disk tool is deprecated.

This section describes how to enable the ceph-volume command-line tool that enables you to deploy and inspect Ceph OSDs using the Logical Volume Management (LVM) functionality for provisioning block devices. The main difference between ceph-disk and ceph-volume is that ceph-volume does not automatically partition disks used for block.db. However, partitioning is performed within the procedure below.

To enable the ceph-volume tool:

Open your Git project repository with the Reclass model on the cluster level.
Open ceph/osd.yml for editing.

Set the lvm_enabled parameter to True:

parameters:
  ceph:
    osd:
      lvm_enabled: True

For each Ceph OSD, add the definition of a bare block.db partition and define its number in db_partition. For example:

parameters:
  ceph:
    osd:
      backend:
        bluestore:
          disks:
            - dev: /dev/vdc
              block_db: /dev/vdd
              db_partition: 1
  linux:
    storage:
      disk:
        /dev/vdd:
          type: gpt
          partitions:
            - size: 10000

Note

Due to a custom disk layout, these definitions may already exist and be configured differently. In this case, verify that the db_partition number is defined.

Apply the changes:
```
salt -C 'I@ceph:osd' saltutil.refresh_pillar
```
Once done, all new Ceph OSDs will be deployed using ceph-volume instead of ceph-disk. The existing Ceph OSDs will cause errors during common operations and must be redeployed or defined as legacy_disks as described below.
Select from the following options:
- Redeploy Ceph OSD nodes or daemons:
  
  Warning
  
  Before redeploying a node, verify that all pools have at least three copies. Redeploy only one node at a time. Redeploying multiple nodes may cause irreversible data loss.
  - Prior to MCP 2019.2.13, redeploy the Ceph OSD nodes:
    1. Remove the OSD nodes as described in Remove a Ceph OSD node.
    2. Add new OSD nodes as described in Add a Ceph OSD node.
  - Starting from MCP 2019.2.13, redeploy the Ceph OSD daemons:
    1. Remove the Ceph OSD daemons as described in Remove a Ceph OSD daemon.
    2. Add new Ceph OSD daemons as described in Add a Ceph OSD daemon.
- Define existing Ceph OSDs as legacy_disks by specifying the legacy_disks pillar for each Ceph OSD in ceph/osd.yml. For example:
```
parameters:
  ceph:
    osd:
      legacy_disks:
        0:
          class: hdd
          weight: 0.048691
          dev: /dev/vdc
```
  Note
  
  Use the legacy_disks option only as a temporary solution for common management of the cluster during the transition period.

Shut down a Ceph cluster for maintenance¶

This section describes how to properly shut down an entire Ceph cluster for maintenance and bring it up afterward.

To shut down a Ceph cluster for maintenance:

Log in to the Salt Master node.
Stop the OpenStack workloads.
Stop the services that are using the Ceph cluster. For example:
- Manila workloads (if you have shares on top of Ceph mount points)
- heat-engine (if it has the autoscaling option enabled)
- glance-api (if it uses Ceph to store images)
- cinder-scheduler (if it uses Ceph to store images)

Identify the first Ceph Monitor for operations:

CEPH_MON=$(salt -C 'I@ceph:mon' --out=txt test.ping | sort | head -1 | \
    cut -d: -f1)

Verify that the Ceph cluster is in healthy state:

salt "${CEPH_MON}" cmd.run 'ceph -s'

Example of system response:

cmn01.domain.com:
        cluster e0b75d1b-544c-4e5d-98ac-cfbaf29387ca
         health HEALTH_OK
         monmap e3: 3 mons at {cmn01=192.168.16.14:6789/0,cmn02=192.168.16.15:6789/0,cmn03=192.168.16.16:6789/0}
                election epoch 42, quorum 0,1,2 cmn01,cmn02,cmn03
         osdmap e102: 6 osds: 6 up, 6 in
                flags sortbitwise,require_jewel_osds
          pgmap v41138: 384 pgs, 6 pools, 45056 kB data, 19 objects
                798 MB used, 60575 MB / 61373 MB avail
                     384 active+clean

Set the following flags to disable rebalancing and restructuring and to pause the Ceph cluster:

salt "${CEPH_MON}" cmd.run 'ceph osd set noout'
salt "${CEPH_MON}" cmd.run 'ceph osd set nobackfill'
salt "${CEPH_MON}" cmd.run 'ceph osd set norecover'
salt "${CEPH_MON}" cmd.run 'ceph osd set norebalance'
salt "${CEPH_MON}" cmd.run 'ceph osd set nodown'
salt "${CEPH_MON}" cmd.run 'ceph osd set pause'

Verify that the flags are set:

salt "${CEPH_MON}" cmd.run 'ceph -s'

Example of system response:

cmn01.domain.com:
        cluster e0b75d1b-544c-4e5d-98ac-cfbaf29387ca
         health **HEALTH_WARN**
                **pauserd**,**pausewr**,**nodown**,**noout**,**nobackfill**,**norebalance**,**norecover** flag(s) set
         monmap e3: 3 mons at {cmn01=192.168.16.14:6789/0,cmn02=192.168.16.15:6789/0,cmn03=192.168.16.16:6789/0}
                election epoch 42, quorum 0,1,2 cmn01,cmn02,cmn03
         osdmap e108: 6 osds: 6 up, 6 in
                flags **pauserd**,**pausewr**,**nodown**,**noout**,**nobackfill**,**norebalance**,**norecover**,sortbitwise,require_jewel_osds
          pgmap v41152: 384 pgs, 6 pools, 45056 kB data, 19 objects
                799 MB used, 60574 MB / 61373 MB avail
                     384 active+clean

Shut down the Ceph cluster.
Warning

Shut down the nodes one by one in the following order:
1. Service nodes (for example, RADOS Gateway nodes)
2. Ceph OSD nodes
3. Ceph Monitor nodes

Once done, perform the maintenance as required.

To start a Ceph cluster after maintenance:

Log in to the Salt Master node.
Start the Ceph cluster nodes.
Warning

Start the Ceph nodes one by one in the following order:
1. Ceph Monitor nodes
2. Ceph OSD nodes
3. Service nodes (for example, RADOS Gateway nodes)
Verify that the Salt minions are up:
```
salt -C "I@ceph:common" test.ping
```
Verify that the date is the same for all Ceph clients:
```
salt -C "I@ceph:common" cmd.run date
```

Identify the first Ceph Monitor for operations:

CEPH_MON=$(salt -C 'I@ceph:mon' --out=txt test.ping | sort | head -1 | \
    cut -d: -f1)

Unset the following flags to resume the Ceph cluster:

salt "${CEPH_MON}" cmd.run 'ceph osd unset pause'
salt "${CEPH_MON}" cmd.run 'ceph osd unset nodown'
salt "${CEPH_MON}" cmd.run 'ceph osd unset norebalance'
salt "${CEPH_MON}" cmd.run 'ceph osd unset norecover'
salt "${CEPH_MON}" cmd.run 'ceph osd unset nobackfill'
salt "${CEPH_MON}" cmd.run 'ceph osd unset noout'

Verify that the Ceph cluster is in healthy state:

salt "${CEPH_MON}" cmd.run 'ceph -s'

Example of system response:

cmn01.domain.com:
        cluster e0b75d1b-544c-4e5d-98ac-cfbaf29387ca
         health HEALTH_OK
         monmap e3: 3 mons at {cmn01=192.168.16.14:6789/0,cmn02=192.168.16.15:6789/0,cmn03=192.168.16.16:6789/0}
                election epoch 42, quorum 0,1,2 cmn01,cmn02,cmn03
         osdmap e102: 6 osds: 6 up, 6 in
                flags sortbitwise,require_jewel_osds
          pgmap v41138: 384 pgs, 6 pools, 45056 kB data, 19 objects
                798 MB used, 60575 MB / 61373 MB avail
                     384 active+clean

Back up and restore Ceph¶

This section describes how to back up and restore Ceph OSD nodes metadata and Ceph Monitor nodes.

Note

This documentation does not provide instructions on how to back up the data stored in Ceph.

Create a backup schedule for Ceph nodes¶

This section describes how to manually create a backup schedule for Ceph OSD nodes metadata and for Ceph Monitor nodes.

By default, the backing up functionality enables automatically for the new MCP OpenStack with Ceph deployments in the cluster models generated using Model Designer. Use this procedure in case of manual deployment only or if you want to change the default backup configuration.

Note

The procedure below does not cover the backup of the Ceph OSD node data.

To create a backup schedule for Ceph nodes:

Log in to the Salt Master node.
Decide on which node you want to store the backups.

Get <STORAGE_ADDRESS> of the node from point 2.

cfg01:~\# salt NODE_NAME grains.get fqdn_ip4

Configure the ceph backup server role by adding the cluster.deployment_name.infra.backup.server class to the definition of the target storage node from step 2:
```
classes:
- cluster.deployment_name.infra.backup.server
 parameters:
   _param:
      ceph_backup_public_key: <generate_your_keypair>
```
By default, adding this include statement results in Ceph keeping five complete backups. To change the default setting, add the following pillar to the cluster/infra/backup/server.yml file:
```
parameters:
  ceph:
    backup:
      server:
        enabled: true
        hours_before_full: 24
        full_backups_to_keep: 5
```
To back up the Ceph Monitor nodes, configure the ceph backup client role by adding the following lines to the cluster/ceph/mon.yml file:

Note

Change <STORAGE_ADDRESS> to the address of the target storage node from step 2
```
classes:
- system.ceph.backup.client.single
parameters:
  _param:
    ceph_remote_backup_server: <STORAGE_ADDRESS>
    root_private_key: |
      <generate_your_keypair>
```
To back up the Ceph OSD nodes metadata, configure the ceph backup client role by adding the following lines to the cluster/ceph/osd.yml file:

Note

Change <STORAGE_ADDRESS> to the address of the target storage node from step 2
```
classes:
- system.ceph.backup.client.single
parameters:
  _param:
    ceph_remote_backup_server: <STORAGE_ADDRESS>
    root_private_key: |
      <generate_your_keypair>
```
By default, adding the above include statement results in Ceph keeping three complete backups on the client node. To change the default setting, add the following pillar to the cluster/ceph/mon.yml or cluster/ceph/osd.yml files:

Note

Change <STORAGE_ADDRESS> to the address of the target storage node from step 2
```
parameters:
  ceph:
    backup:
      client:
        enabled: true
        full_backups_to_keep: 3
        hours_before_full: 24
        target:
          host: <STORAGE_ADDRESS>
```
Refresh Salt pillars:
```
salt -C '*' saltutil.refresh_pillar
```

Apply the salt.minion state:

salt -C 'I@ceph:backup:client or I@ceph:backup:server' state.sls salt.minion

Refresh grains for the ceph client node:

salt -C 'I@ceph:backup:client' saltutil.sync_grains

Update the mine for the ceph client node:

salt -C 'I@ceph:backup:client' mine.flush
salt -C 'I@ceph:backup:client' mine.update

Apply the following state on the ceph client node:

salt -C 'I@ceph:backup:client' state.sls openssh.client,ceph.backup

Apply the linux.system.cron state on the ceph server node:

salt -C 'I@ceph:backup:server' state.sls linux.system.cron

Apply the ceph.backup state on the ceph server node:

salt -C 'I@ceph:backup:server' state.sls ceph.backup

See also

Create an instant backup of a Ceph OSD node metadata or a Ceph Monitor node¶

After you create a backup schedule as described in Create a backup schedule for Ceph nodes, you may also need to create an instant backup of a Ceph OSD node metadata or a Ceph Monitor node.

Note

The procedure below does not cover the backup of the Ceph OSD node data.

To create an instant backup of a Ceph node:

Verify that you have completed the steps described in Create a backup schedule for Ceph nodes.
Log in to a Ceph node. For example, to cmn01.

Run the following script:

/usr/local/bin/ceph-backup-runner-call.sh

Verify that a complete backup was created locally:
```
ls /var/backups/ceph/full
```
Verify that the complete backup was rsynced to the ceph backup server node from the Salt Master node:
```
salt -C 'I@ceph:backup:server' cmd.run 'ls /srv/volumes/backup/ceph/full'
```

See also

Restore a Ceph Monitor node¶

You may need to restore a Ceph Monitor node after a failure. For example, if the data in the Ceph-related directories disappeared.

To restore a Ceph Monitor node:

Verify that the Ceph Monitor instance is up and running and connected to the Salt Master node.
Log in to the Ceph Monitor node.

Synchronize Salt modules and refresh Salt pillars:

salt-call saltutil.sync_all
salt-call saltutil.refresh_pillar

Run the following Salt states:

salt-call state.sls linux,openssh,salt,ntp,rsyslog

Manually install Ceph packages:
```
apt install ceph-mon -y
```
Remove the following files from Ceph:
```
rm -rf /etc/ceph/* /var/lib/ceph/*
```

From the Ceph backup, copy the files from /etc/ceph/ and /var/lib/ceph to their original directories:

cp -r /<etc_ceph_backup_path>/* /etc/ceph/
cp -r /<var_lib_ceph_backup_path>/* /var/lib/ceph/

Change the files ownership:
```
chown -R ceph:ceph /var/lib/ceph/*
```
Run the following Salt state:
```
salt-call state.sls ceph
```
If the output contains an error, rerun the state.

Restore the metadata of a Ceph OSD node¶

You may need to restore the metadata of a Ceph OSD node after a failure. For example, if the primary disk fails or the data in the Ceph-related directories, such as /var/lib/ceph/, on the OSD node disappeared.

To restore the metadata of a Ceph OSD node:

Verify that the Ceph OSD node is up and running and connected to the Salt Master node.
Log in to the Ceph OSD node.

Synchronize Salt modules and refresh Salt pillars:

salt-call saltutil.sync_all
salt-call saltutil.refresh_pillar

Run the following Salt states:

salt-call state.sls linux,openssh,salt,ntp,rsyslog

Manually install Ceph packages:
```
apt install ceph-osd -y
```
Stop all ceph-osd services:
```
systemctl stop ceph-osd@<num>
```
Remove the following files from Ceph:
```
rm -rf /etc/ceph/* /var/lib/ceph/*
```
From the Ceph backup, copy the files from /etc/ceph/ and /var/lib/ceph to their original directories:
```
cp -r /<path>/* /etc/ceph/
cp -r /<path>/* /var/lib/ceph/
```
Change the files ownership:
```
chown -R ceph:ceph /var/lib/ceph/*
```
Restart the services for all Ceph OSDs:
```
systemctl restart ceph-osd@<osd_num>
```

Migrate the Ceph back end¶

Ceph uses FileStore or BlueStore as a storage back end. You can migrate the Ceph storage back end from FileStore to BlueStore and vice versa using the Ceph - backend migration pipeline.

Note

Starting from the 2019.2.10 maintenance update, this procedure is deprecated and all Ceph OSDs should use LVM with BlueStore. Back-end migration is described in Enable the ceph-volume tool.

For earlier versions, if you are going to upgrade Ceph to Nautilus, also skip this procedure to avoid a double migration of the back end. In this case, first apply the 2019.2.10 maintenance update and then enable ceph-volume as well.

To migrate the Ceph back end:

In your project repository, open the cluster/ceph/osd.yml file for editing:
1. Change the back end type and block_db or journal for every OSD disk device.
2. Specify the size of the journal or block_db device if it resides on another device than the storage device. The device storage will be divided equally by the number of OSDs using it.
Example:
```
parameters:
  ceph:
    osd:
      bluestore_block_db_size: 10073741824
#      journal_size: 10000
      backend:
#        filestore:
        bluestore:
          disks:
          - dev: /dev/sdh
            block_db: /dev/sdj
#            journal: /dev/sdj
```
Where the commented lines are the example lines that must be replaced and removed if migrating from FileStore to BlueStore.
Log in to the Jenkins web UI.
Open the Ceph - backend migration pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
ADMIN_HOST	Add `cmn01*` as the Ceph cluster node with the admin keyring.
TARGET	Add the Salt target name of the Ceph OSD node(s). For example, `osd005` to migrate on one OSD HOST or `osd` to migrate on all OSD hosts.
OSD	Add `*` to target all OSD disks on all `TARGET` OSD hosts or comma-separated list of Ceph OSDs if targeting just one OSD host by `TARGET` For example `1,2`.
WAIT_FOR_HEALTHY	Verify that this parameter is selected as it enables the Ceph health check within the pipeline.
PER_OSD_CONTROL	Select to verify the Ceph status after migration of each OSD disk.
PER_OSD_HOST_CONTROL	Select to verify the Ceph status after the whole OSD host migration.
CLUSTER_FLAGS	Add a comma-separated list of flags to apply for the migration procedure. Tested with blank.
ORIGIN_BACKEND	Specify the Ceph back end before migration.

Note

The PER_OSD_CONTROL and PER_OSD_HOST_CONTROL options provide granular control during the migration to verify each OSD disk after its migration. You can decide to continue or abort.

Click Deploy.

The Ceph - upgrade pipeline workflow:

Set back-end migration flags.
Perform the following for each targeted OSD disk:
1. Mark the Ceph OSD as out.
2. Stop the Ceph OSD service.
3. Remove the Ceph OSD authentication key.
4. Remove the Ceph OSD from the Ceph cluster
5. Remove block_db, block_wal, or journal of the OSD.
Run the ceph.osd state to deploy the OSD with a desired back end.
Unset the back-end migration flags.

Note

During the pipeline execution, a check is performed to verify whether the back end type for an OSD disk differs from the one specified in ORIGIN_BACKEND. If the back end differs, Jenkins does not apply any changes to that OSD disk.

Migrate the management of a Ceph cluster¶

You can migrate the management of an existing Ceph cluster deployed by Decapod to a cluster managed by the Ceph Salt formula.

To migrate the management of a Ceph cluster:

Log in to the Decapod web UI.
Navigate to the CONFIGURATIONS tab.
Select the required configuration and click VIEW.
Generate a new cluster model with Ceph as described in MCP Deployment Guide: Create a deployment metadata model using the Model Designer. Verify that you fill in the correct values from the Decapod configuration file displayed in the VIEW tab of the Decapod web UI.
In the <cluster_name>/ceph/setup.yml file, specify the right pools and parameters for the existing pools.

Note

Verify that the keyring names and their caps match the ones that already exist in the Ceph cluster deployed by Decapod.

In the <cluster_name>/infra/config.yml file, add the following pillar and modify the parameters according to your environment:

ceph:
  decapod:
    ip: 192.168.1.10
    user: user
    pass: psswd
    deploy_config_name: ceph

On the node defined in the previous step, apply the following state:
```
salt-call state.sls ceph.migration
```
Note

The output of this state must contain defined configurations, Ceph OSD disks, Ceph File System ID (FSID), and so on.
Using the output of the previous command, add the following pillars to your cluster model:
1. Add the ceph:common pillar to <cluster_name>/ceph/common.yml.
2. Add the ceph:osd pillar to <cluster_name>/ceph/osd.yml.
Examine the newly generated cluster model for any occurrence of the ceph keyword and verify that it exists in your current cluster model.
Examine each Ceph cluster file to verify that the parameters match the configuration specified in Decapod.
Copy the Ceph cluster directory to the existing cluster model.
Verify that the ceph subdirectory is included in your cluster model in <cluster_name>/infra/init.yml or <cluster_name>/init.yml for older cluster model versions:
```
classes:
- cluster.<cluster_name>.ceph
```

Add the Reclass storage nodes to <cluster_name>/infra/config.yml and change the count variable to the number of OSDs you have. For example:

classes:
- system.reclass.storage.system.ceph_mon_cluster
- system.reclass.storage.system.ceph_rgw_cluster # Add this line only if
# RadosGW services run on separate nodes than the Ceph Monitor services.
parameters:
  reclass:
    storage:
      node:
        ceph_osd_rack01:
          name: ${_param:ceph_osd_rack01_hostname}<<count>>
          domain: ${_param:cluster_domain}
          classes:
            - cluster.${_param:cluster_name}.ceph.osd
          repeat:
            count: 3
            start: 1
            digits: 3
            params:
              single_address:
                value: ${_param:ceph_osd_rack01_single_subnet}.<<count>>
                start: 201
              backend_address:
                value: ${_param:ceph_osd_rack01_backend_subnet}.<<count>>
                start: 201

If the Ceph RADOS Gateway service is running on the same nodes as the Ceph monitor services:

Add the following snippet to <cluster_name>/infra/config.yml:

reclass:
  storage:
    node:
      ceph_mon_node01:
        classes:
        - cluster.${_param:cluster_name}.ceph.rgw
      ceph_mon_node02:
        classes:
        - cluster.${_param:cluster_name}.ceph.rgw
      ceph_mon_node03:
        classes:
        - cluster.${_param:cluster_name}.ceph.rgw

Verify that the parameters in <cluster_name>/ceph/rgw.yml are defined correctly according to the existing Ceph cluster.

From the Salt Master node, generate the Ceph nodes:
```
salt-call state.sls reclass
```
Run the commands below.

Warning

If the outputs of the commands below contain any changes that can potentially break the cluster, change the cluster model as needed and optionally run the salt-call pillar.data ceph command to verify that the Salt pillar contains the correct value. Proceed to the next step only once you are sure that your model is correct.
- From the Ceph monitor nodes:
```
salt-call state.sls ceph test=True
```
- From the Ceph OSD nodes:
```
salt-call state.sls ceph test=True
```
- From the Ceph RADOS Gateway nodes:
```
salt-call state.sls ceph test=True
```
- From the Salt Master node:
```
salt -C 'I@ceph:common' state.sls ceph test=True
```

Once you have verified that no changes by the Salt Formula can break the running Ceph cluster, run the following commands.

From the Salt Master node:

salt -C 'I@ceph:common:keyring:admin' state.sls ceph.mon
salt -C 'I@ceph:mon' saltutil.sync_grains
salt -C 'I@ceph:mon' mine.update
salt -C 'I@ceph:mon' state.sls ceph.mon

From one of the OSD nodes:
```
salt-call state.sls ceph.osd
```
Note

Before you proceed, verify that the OSDs on this node are working fine.

From the Salt Master node:

salt -C 'I@ceph:osd' state.sls ceph.osd

From the Salt Master node:

salt -C 'I@ceph:radosgw' state.sls ceph.radosgw

Enable RBD monitoring¶

Warning

This feature is available as technical preview starting from the MCP 2019.2.10 maintenance update and requires Ceph Nautilus. Use such configuration for testing and evaluation purposes only. Before using the feature, follow the steps described in Apply maintenance updates.

If required, you can enable RADOS Block Device (RBD) images monitoring introduced with Ceph Nautilus. Once done, you can view RBD metrics using the Ceph RBD Overview Grafana dashboard. For details, see Ceph dashboards.

To enable RBD monitoring:

Open your Git project repository with the Reclass model on the cluster level.

In classes/cluster/<cluster_name>/ceph/setup.yml add the rbd_stats flag for pools serving RBD images to enable serving RBD metrics:

parameters:
  ceph:
    setup:
      pool:
        <pool_name>:
          pg_num: 8
          pgp_num: 8
          type: replicated
          application: rbd
          rbd_stats: True

In classes/cluster/<cluster_name>/ceph/common.yml, set the rbd_monitoring_enabled parameter to True to enable the Ceph RBD Overview Grafana dashboard:
```
ceph:
  common:
    public_network: 10.13.0.0/16
    cluster_network: 10.12.0.0/16
    rbd_monitoring_enabled: True
```
Log in to the Salt Master node.

Apply the changes:

salt "*" saltutil.refresh_pillar
salt "*" state.apply salt.minion.grains
salt "*" saltutil.refresh_grains
salt -C "I@ceph:mgr" state.apply 'ceph.mgr'
salt -C 'I@grafana:client' state.apply 'grafana.client'

Enable granular distribution of Ceph keys¶

Note

This feature is available starting from the MCP 2019.2.14 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable granular distribution of Ceph keys on an existing deployment to avoid keeping the Ceph keys for the services that do not belong to a particular node.

To enable granular distribution of Ceph keys:

Open your Git project repository with the Reclass model on the cluster level.
Create a new ceph/keyrings folder.
Open the ceph/common.yml file for editing.

Move the configuration for each component from the parameters:ceph:common:keyrings section to a corresponding file in the newly created folder. For example, the following configuration must be split to four different files.

ceph:
 common:
   keyring:
     glance:
       name: ${_param:glance_storage_user}
       caps:
         mon: 'allow r, allow command "osd blacklist"'
         osd: "profile rbd pool=images"
     cinder:
       name: ${_param:cinder_storage_user}
       caps:
         mon: 'allow r, allow command "osd blacklist"'
         osd: "profile rbd pool=volumes, profile rbd-read-only pool=images, profile rbd pool=${_param:cinder_ceph_backup_pool}"
     nova:
       name: ${_param:nova_storage_user}
       caps:
         mon: 'allow r, allow command "osd blacklist"'
         osd: "profile rbd pool=vms, profile rbd-read-only pool=images"
     gnocchi:
       name: ${_param:gnocchi_storage_user}
       caps:
         mon: 'allow r, allow command "osd blacklist"'
         osd: "profile rbd pool=${_param:gnocchi_storage_pool}"

In this case, each file must have its own component keyring. For example:

In ceph/keyrings/nova.yml, add:

parameters:
  ceph:
    common:
      keyring:
        nova:
          name: ${_param:nova_storage_user}
          caps:
            mon: 'allow r, allow command "osd blacklist"'
            osd: "profile rbd pool=vms, profile rbd-read-only pool=images"

In ceph/keyrings/cinder.yml, add:

parameters:
  ceph:
    common:
      keyring:
        cinder:
          name: ${_param:cinder_storage_user}
          caps:
            mon: 'allow r, allow command "osd blacklist"'
            osd: "profile rbd pool=volumes, profile rbd-read-only pool=images, profile rbd pool=${_param:cinder_ceph_backup_pool}"

In ceph/keyrings/glance.yml, add:

parameters:
  ceph:
    common:
      keyring:
        glance:
          name: ${_param:glance_storage_user}
          caps:
            mon: 'allow r, allow command "osd blacklist"'
            osd: "profile rbd pool=images"

In ceph/keyrings/gnocchi.yml, add:

parameters:
  ceph:
    common:
      keyring:
        gnocchi:
          name: ${_param:gnocchi_storage_user}
          caps:
            mon: 'allow r, allow command "osd blacklist"'
            osd: "profile rbd pool=${_param:gnocchi_storage_pool}"

In the same ceph/keyrings folder, create an init.yml file and add the newly created keyrings:

classes:
- cluster.<cluster_name>.ceph.keyrings.glance
- cluster.<cluster_name>.ceph.keyrings.cinder
- cluster.<cluster_name>.ceph.keyrings.nova
- cluster.<cluster_name>.ceph.keyrings.gnocchi

Note

If Telemetry is disabled, Gnocchi may not be present in your deployment.

In openstack/compute/init.yml, add the Cinder and Nova keyrings after class cluster.<cluster_name>.ceph.common:
```
- cluster.<cluster_name>.ceph.keyrings.cinder
- cluster.<cluster_name>.ceph.keyrings.nova
```
In openstack/control.yml, add the following line after cluster.<cluster_name>.ceph.common:
```
- cluster.<cluster_name>.ceph.keyrings
```
In openstack/telemetry.yml add the Gnocchi keyring after class cluster.<cluster_name>.ceph.common:
```
- cluster.<cluster_name>.ceph.keyrings.gnocchi
```
Log in to the Salt Master node.

Synchronize the Salt modules and update mines:

salt "*" saltutil.sync_all
salt "*" mine.update

Drop the redundant keyrings from the corresponding nodes and verify that the keyrings will not change with the new Salt run:

Note

If ceph:common:manage_keyring is enabled, modify the last state for each component using the following template:

salt "<target>" state.sls ceph.common,ceph.setup.keyring,ceph.setup.managed_keyring test=true

For the OpenStack compute nodes, run:

salt "cmp*" cmd.run "rm /etc/ceph/ceph.client.glance.keyring"
salt "cmp*" cmd.run "rm /etc/ceph/ceph.client.gnocchi.keyring"
salt "cmp*" state.sls ceph.common,ceph.setup.keyring test=true

For the Ceph Monitor nodes, run:

salt "cmn*" cmd.run "rm /etc/ceph/ceph.client.glance.keyring"
salt "cmn*" cmd.run "rm /etc/ceph/ceph.client.gnocchi.keyring"
salt "cmn*" cmd.run "rm /etc/ceph/ceph.client.nova.keyring"
salt "cmn*" cmd.run "rm /etc/ceph/ceph.client.cinder.keyring"
salt "cmn*" state.sls ceph.common,ceph.setup.keyring test=true

For the RADOS Gateway nodes, run:

salt "rgw*" cmd.run "rm /etc/ceph/ceph.client.glance.keyring"
salt "rgw*" cmd.run "rm /etc/ceph/ceph.client.gnocchi.keyring"
salt "rgw*" cmd.run "rm /etc/ceph/ceph.client.nova.keyring"
salt "rgw*" cmd.run "rm /etc/ceph/ceph.client.cinder.keyring"
salt "rgw*" state.sls ceph.common,ceph.setup.keyring test=true

For the Telemetry nodes, run:

salt "mdb*" cmd.run "rm /etc/ceph/ceph.client.glance.keyring"
salt "mdb*" cmd.run "rm /etc/ceph/ceph.client.nova.keyring"
salt "mdb*" cmd.run "rm /etc/ceph/ceph.client.cinder.keyring"
salt "mdb*" state.sls ceph.common,ceph.setup.keyring test=true

Apply the changes for all components one by one:

If ceph:common:manage_keyring is disabled:

salt "<target>" state.sls ceph.common,ceph.setup.keyring

If ceph:common:manage_keyring is enabled:

salt "<target>" state.sls ceph.common,ceph.setup.keyring,ceph.setup.managed_keyring

See also

Update Ceph Upgrade Ceph

Glance operations¶

This section describes the OpenStack Image service (Glance) operations you may need to perform after the deployment of an MCP cluster.

Enable uploading of an image through Horizon with self-managed SSL certificates¶

By default, the OpenStack Dashboard (Horizon) supports direct uploading of images to Glance. However, if an MCP cluster is deployed using self-signed certificates for public API endpoints and Horizon, uploading of images to Glance through the Horizon web UI may fail. While accessing the Horizon web UI of such MCP deployment for the first time, a warning informs that the site is insecure and you must force trust the certificate of this site. However, when trying to upload an image directly from the web browser, the certificate of the Glance API is still not considered by the web browser as a trusted one since host:port of the site is different. In this case, you must explicitly trust the certificate of the Glance API.

To enable uploading of an image through Horizon with self-managed SSL certificates:

Navigate to the Horizon web UI.
On the page that opens, configure your web browser to trust the Horizon certificate if you have not done so yet:
- In Google Chrome or Chromium, click Advanced > Proceed to <URL> (unsafe).
- In Mozilla Firefox, navigate to Advanced > Add Exception, enter the URL in the Location field, and click Confirm Security Exception.
Note

For other web browsers, the steps may vary slightly.
Navigate to Project > API Access.
Copy the Service Endpoint URL of the Image service.
Open this URL in a new window or tab of the same web browser.

Configure your web browser to trust the certificate of this site as described in the step 2.

As a result, the version discovery document should appear with contents depending on the OpenStack version. For example, for OpenStack Ocata:

{"versions": [{"status": "CURRENT", "id": "v2.5", "links": \
[{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
{"status": "SUPPORTED", "id": "v2.4", "links": \
[{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
{"status": "SUPPORTED", "id": "v2.3", "links": \
[{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
{"status": "SUPPORTED", "id": "v2.2", "links": \
[{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
{"status": "SUPPORTED", "id": "v2.1", "links": \
[{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
{"status": "SUPPORTED", "id": "v2.0", "links": \
[{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}]}

Once done, you should be able to upload an image through Horizon with self-managed SSL certificates.

Telemetry operations¶

This section describes the Tenant Telemetry service (Ceilometer) operations you may need to perform after the deployment of an MCP cluster.

Enable the Gnocchi archive policies in Tenant Telemetry¶

The Gnocchi archive policies allow you to define the aggregation and storage policies for metrics received from Ceilometer.

Each archive policy definition is set as the number of points over a timespan. The default archive policy contains two definitions and one rule. It allows you to store metrics for seven days with granularity of one minute and for 365 days with granularity of one hour. It is applied to any metrics sent to Gnocchi with the metric pattern *. You can customize all parameters on the cluster level of your Reclass model.

To enable the Gnocchi archive policies:

Open your Git project repository with the Reclass model on the cluster level.
In /openstack/telemetry.yml, verify that the following class is present:
```
classes:
...
- system.ceilometer.server.backend.gnocchi
```

In /openstack/control/init.yml, add the following classes:

classes:
...
- system.gnocchi.client
- system.gnocchi.client.v1.archive_policy.default

The parameters of system.gnocchi.client.v1.archive_policy.default are as follows:

parameters:
  _param:
    gnocchi_default_policy_granularity_1: '0:01:00'
    gnocchi_default_policy_points_1: 10080
    gnocchi_default_policy_timespan_1: '7 days'
    gnocchi_default_policy_granularity_2: '1:00:00'
    gnocchi_default_policy_points_2: 8760
    gnocchi_default_policy_timespan_2: '365 days'
    gnocchi_default_policy_rule_metric_pattern: '"*"'
  gnocchi:
    client:
      resources:
        v1:
          enabled: true
          cloud_name: 'admin_identity'
          archive_policies:
            default:
              definition:
                - granularity: "${_param:gnocchi_default_policy_granularity_1}"
                  points: "${_param:gnocchi_default_policy_points_1}"
                  timespan: "${_param:gnocchi_default_policy_timespan_1}"
                - granularity: "${_param:gnocchi_default_policy_granularity_2}"
                  points: "${_param:gnocchi_default_policy_points_2}"
                  timespan: "${_param:gnocchi_default_policy_timespan_2}"
              rules:
                default:
                  metric_pattern: "${_param:gnocchi_default_policy_rule_metric_pattern}"

Optional. Specify additional archive policies as required. For example, to aggregate the CPU and disk-related metrics with the timespan of 30 days and granularity 1, add the following parameters to /openstack/control/init.yml under the default Gnocchi archive policy parameters:

parameters:
  _param:
  ...
  gnocchi:
    client:
      resources:
        v1:
          enabled: true
          cloud_name: 'admin_identity'
          archive_policies:
            default:
            ...
            cpu_disk_policy:
              definition:
                - granularity: '0:00:01'
                  points: 2592000
                  timespan: '30 days'
              rules:
              cpu_rule:
                metric_pattern: 'cpu*'
              disk_rule:
                metric_pattern: 'disk*'

Caution

Rule names defined across archive policies must be unique.

Apply the following states:

salt -C 'I@gnocchi:client and *01*' saltutil.pillar_refresh
salt -C 'I@gnocchi:client and *01*' state.sls gnocchi.client
salt -C 'I@gnocchi:client' state.sls gnocchi.client

Verify that the archive policies are set successfully:

Boot a test VM:

source keystonercv3
openstack server create --flavor <flavor_id> \
--nic net-id=<net_id> --image <image_id>  test_vm1

Run the following command:

openstack metric list | grep <vm_id>

Use the vm_id parameter value from the output of the command that you run in the previous step.

Example of system response extract:

+---------+-------------------+-------------------------------+------+-----------+
| id      |archive_policy/name| name                          | unit |resource_id|
+---------+-------------------+-------------------------------+------+-----------+
| 0ace... | cpu_disk_policy   | disk.allocation               | B    | d9011...  |
| 0ca6... | default           | perf.instructions             | None | d9011...  |
| 0fcb... | default           | compute.instance.booting.time | sec  | d9011...  |
| 10f0... | cpu_disk_policy   | cpu_l3_cache                  | None | d9011...  |
| 2392... | default           | memory                        | MB   | d9011...  |
| 2395... | cpu_disk_policy   | cpu_util                      | %    | d9011...  |
| 26a0... | default           | perf.cache.references         | None | d9011...  |
| 367e... | cpu_disk_policy   | disk.read.bytes.rate          | B/s  | d9011...  |
| 3857... | default           | memory.bandwidth.total        | None | d9011...  |
| 3bb2... | default           | memory.usage                  | None | d9011...  |
| 4288... | cpu_disk_policy   | cpu                           | ns   | d9011...  |
+---------+-------------------+-------------------------------+------+-----------+

In the example output above, all metrics are aggregated using the default archive policy except for the CPU and disk metrics aggregated by cpu_disk_policy. The cpu_disk_policy parameters were previously customized in the Reclass model.

Add availability zone to Gnocchi instance resource¶

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to add availability zones to a Gnocchi instance and consume the consuming instance.create.end events.

Add an availability zone to a Gnocchi instance resource:

Open your Git project repository with the Reclass model on the cluster level.

In /openstack/telemetry.yml, set the create_resources parameter to True:

ceilometer:
  server:
    publisher:
      gnocchi:
        enabled: True
        create_resources: True

From the Salt Master node, apply the following state:

salt -C 'I@ceilometer:server' saltutil.refresh_pillar
salt -C 'I@ceilometer:server' state.apply ceilometer.server

Migrate from GlusterFS to rsync for fernet and credential keys rotation¶

By default, the latest MCP deployments use rsync for fernet and credential keys rotation. Though, if your MCP version is 2018.8.0 or earlier, GlusterFS is used as a default rotation driver and credential keys rotation driver. This section provides an instruction on how to configure your MCP OpenStack deployment to use rsync with SSH instead of GlusterFS.

To migrate from GlusterFS to rsync:

Log in to the Salt Master node.
On the system level, verify that the following class is included in keystone/server/cluster.yml:
```
- system.keystone.server.fernet_rotation.cluster
```
Note

The default configuration for the system.keystone.server.fernet_rotation.cluster class is defined in keystone/server/fernet_rotation/cluster.yml. It includes the default list of nodes to synchronize fernet and credential keys that are sync_node01 and sync_node02. If there are more nodes to synchronize fernet and credential keys, expand this list as required.

Verify that the crontab job is disabled in the keystone/client/core.yml and keystone/client/single.yml system-level files:

linux:
  system:
    job:
      keystone_job_rotate:
        command: '/usr/bin/keystone-manage fernet_rotate --keystone-user keystone --keystone-group keystone >> /var/log/key_rotation_log 2>> /var/log/key_rotation_log'
        enabled: false
        user: root
        minute: 0

Apply the Salt orchestration state to configure all required prerequisites like creating an SSH public key, uploading it to mine and secondary control nodes:
```
salt-run state.orchestrate keystone.orchestrate.deploy
```
Apply the keystone.server state to put the Keystone rotation script and run it in the sync mode hence fernet and credential keys will be synchronized with the Keystone secondary nodes:
```
salt -C 'I@keystone:server:role:primary' state.apply keystone.server
salt -C 'I@keystone:server' state.apply keystone.server
```
Apply the linux.system state to add crontab jobs for the Keystone user:
```
salt -C 'I@keystone:server' state.apply linux.system
```

On all OpenStack Controller nodes:

Copy the current credential and fernet keys to temporary directories:

mkdir /tmp/keystone_credential /tmp/keystone_fernet
cp /var/lib/keystone/credential-keys/* /tmp/keystone_credential
cp /var/lib/keystone/fernet-keys/* /tmp/keystone_fernet

Unmount the related GlusterFS mount points:

umount /var/lib/keystone/credential-keys
umount /var/lib/keystone/fernet-keys

Copy the keys from the temporary directories to var/lib/keystone/credential-keys/ and /var/lib/keystone/fernet-keys/:

mkdir -p /var/lib/keystone/credential-keys/ /var/lib/keystone/fernet-keys/
cp /tmp/keystone_credential/* /var/lib/keystone/credential-keys/
cp /tmp/keystone_fernet/* /var/lib/keystone/fernet-keys/
chown -R keystone:keystone /var/lib/keystone/credential-keys/*
chown -R keystone:keystone /var/lib/keystone/fernet-keys/*

On a KVM node, stop and delete the keystone-credential-keys and keystone-keys volumes:

Stop the volumes:

gluster volume stop keystone-credential-keys
gluster volume stop keystone-keys

Delete the GlusterFS volumes:

gluster volume delete keystone-credential-keys
gluster volume delete keystone-keys

On the cluster level model, remove the following GlusterFS classes included in the openstack/control.yml file by default:
```
- system.glusterfs.server.volume.keystone
- system.glusterfs.client.volume.keystone
```

Disable the Memcached listener on the UDP port¶

Starting from the Q4‘18 MCP release, to reduce the attack surface and increase the product security, Memcached on the controller nodes listens on TCP only. The UDP port for Memcached is disabled by default. This section explains how to disable the UDP listeners for the existing OpenStack environments deployed on top of the earlier MCP versions.

To disable the Memcached listener on the UDP port:

Log in to the Salt Master node.
Update your Reclass metadata model.

Verify the memcached:server pillar:

salt ctl01* pillar.get memcached:server

The memcached:server:bind:proto pillar should be available after update of the Reclass metadata model and set to False for proto:udp:enabled for all Memcached server instances.

Example of system response:

-- start output --
  ----------
  bind:
      ----------
      address:
          0.0.0.0
      port:
          11211
      proto:
          ----------
          tcp:
              ----------
              enabled:
                  True
          udp:
              ----------
              enabled:
                  False
      protocol:
          tcp
  enabled:
      True
  maxconn:
      8192
-- end output --

Run the memcached.server state to apply the changes to all memcached instances:
```
salt -C 'I@memcached:server' state.sls memcached.server
```

Configuring rate limiting with NGINX¶

MCP enables you to limit the number of HTTP requests that a user can make in a given period of time for your OpenStack deployments. The rate limiting with NGINX can be used to protect an OpenStack environment against DDoS attacks as well as to protect the community application servers from being overwhelmed by too many user requests at the same time.

For rate limiting configuration, MCP supports the following NGINX modules:

ngx_http_geo_module
ngx_http_map_module
ngx_http_limit_req_module
ngx_http_limit_conn_module

This section provides the related NGINX directives description with the configuration samples which you can use to enable rate limiting in your MCP OpenStack deployment.

Starting from the MCP 2019.2.20 maintenance update, you can also configure request limiting for custom locations.

NGINX rate limiting configuration sample¶

This section includes the configuration sample of NGINX rate limiting feature that enables you to limit the number of HTTP requests a user can make in a given period of time.

In the sample, all clients except for 10.12.100.1 are limited to 1 request per second. More specifically, the sample illustrates how to:

Create a geo instance that will match the IP address and set the limit_action variable where 0 stands for unlimited and 1 stands for limited.
Create global_geo_limiting_map that will map ip_limit_key to ip_limit_action.
Create a global limit_req_zone zone called global_limit_zone that limits the number of requests to 1 request per second.
Apply global_limit_zone globally to all requests with 5 requests burst and nodelay.

Configuration sample:

nginx:
  server:
    enabled: true
    geo:
      enabled: true
      items:
        global_geo_limiting:
          enabled: true
          variable: ip_limit_key
          body:
            default:
              value: '1'
            unlimited_client1:
              name: '10.12.100.1/32'
              value: '0'
    map:
      enabled: true
      items:
        global_geo_limiting_map:
          enabled: true
          string: ip_limit_key
          variable: ip_limit_action
          body:
            limited:
              name: 1
              value: '$binary_remote_addr'
            unlimited:
              name: 0
              value: '""'
    limit_req_module:
      limit_req_zone:
        global_limit_zone:
          key: ip_limit_action
          size: 10m
          rate: '1r/s'
      limit_req_status: 503
      limit_req:
         global_limit_zone:
           burst: 5
           enabled: true

To apply the request limiting to a particular site, define the limit_req on a site level. For example:

nginx:
  server:
    site:
      nginx_proxy_openstack_api_keystone:
        limit_req_module:
          limit_req:
            global_limit_zone:
              burst: 5
              nodelay: true
              enabled: true

Configuring the geo module¶

The ngx_http_geo_module module creates variables with values depending on the client IP address.

Syntax	`geo [$address] $variable { ... }`
Default	—
Context	HTTP
NGINX configuration sample	geo $my_geo_map { default 0; 127.0.0.1 0; 10.12.100.1/32 1; 10.13.0.0/16 2; 2001:0db8::/32 1; }

Example of a Salt pillar for the geo module:

nginx:
server:
  geo:
    enabled: true
    items:
      my_geo_map:
        enabled: true
        variable: my_geo_map_variable
        body:
          default:
            value: '0'
          localhost:
            name: 127.0.0.1
            value: '0'
          client:
            name: 10.12.100.1/32
            value: '1'
          network:
            name: 10.13.0.0/16
            value:  '2'
          ipv6_client:
            name: 2001:0db8::/32
            value: '1'

All geo variables specified in the pillars, after applying the nginx.server state, will be reflected in the /etc/nginx/conf.d/geo.conf file.

Configuring the mapping¶

The ngx_http_map_module module creates variables which values depend on values of other source variables specified in the first parameter.

Syntax	`map string $variable { ... }`
Default	—
Context	HTTP
NGINX configuration sample	map $my_geo_map_variable $ip_limit_action { default ""; 1 $binary_remote_addr; 0 ""; }

Example of a Salt pillar for the map module:

nginx:
  server:
    map:
      enabled: true
      items:
        global_geo_limiting_map:
          enabled: true
          string: my_geo_map_variable
          variable: ip_limit_action
          body:
            default:
              value: '""'
            limited:
              name: '1'
              value: '$binary_remote_addr'
            unlimited:
              name: '0'
              value: '""'

All map variables specified in the pillars, after applying the nginx.server state, will be reflected in the /etc/nginx/conf.d/map.conf file.

Configuring the request limiting¶

The ngx_http_limit_req_module module limits the request processing rate per a defined key. The module directives include the mandatory limit_req_zone and limit_req directives and an optional limit_req_status directive.

The limit_req_zone directive defines the parameters for the rate limiting.

Syntax	`limit_req_zone key zone=name:size rate=rate [sync];`
Default	—
Context	HTTP
NGINX configuration sample	limit_req_zone $binary_remote_addr zone=global_limit_zone1:10m rate=1r/s ; limit_req_zone $ip_limit_action zone=global_limit_zone2:10m rate=2r/s ;

The limit_req directive enables rate limiting within the context where it appears.

Syntax	`limit_req zone=name [burst=number] [nodelay \| delay=number];`
Default	—
Context	HTTP, server, location
NGINX configuration sample	limit_req zone=global_limit_zone1 burst=2 ; limit_req zone=global_limit_zone2 burst=4 nodelay ;

The limit_req_status directive sets the status code to return in response to rejected requests.

Syntax	`limit_req_status code;`
Default	`limit_req_status 503;`
Context	http, server, location that corresponds to the `nginx:server` and `nginx:server:site` definitions of a pillar.
NGINX configuration sample	limit_req_status 429;

Example of a Salt pillar for limit_req_zone and limit_req:

nginx:
server:
  limit_req_module:
    limit_req_zone:
      global_limit_zone1:
        key: binary_remote_addr
        size: 10m
        rate: '1r/s'
      global_limit_zone2:
        key: ip_limit_action
        size: 10m
        rate: '2r/s'
    limit_req_status: 429
    limit_req:
      global_limit_zone1:
        burst: 2
        enabled: true
      global_limit_zone2:
        burst: 4
        enabled: true
        nodelay: true

In the configuration example above, the states are kept in a 10 megabyte global_limit_zone1 and global_limit_zone2 zones. An average request processing rate cannot exceed 1 request per second for global_limit_zone1 and 2 requests per second for global_limit_zone2.

The $binary_remote_addr, a client’s IP address, serves as a key for the global_limit_zone1 zone. And the mapped $ip_limit_action variable is a key for the global_limit_zone2 zone.

To apply the request limiting to a particular site, define the limit_req on a site level. For example:

nginx:
  server:
    site:
      nginx_proxy_openstack_api_keystone:
        limit_req_module:
          limit_req:
            global_limit_zone:
              burst: 5
              nodelay: true
              enabled: true

Configuring the connection limiting¶

The ngx_http_limit_conn_module module limits the number of connections per defined key. The main directives include limit_conn_zone and limit_conn.

The limit_conn_zone directive sets parameters for a shared memory zone that keeps states for various keys. A state is the current number of connections. The key value can contain text, variables, and their combination. The requests with an empty key value are not accounted.

Syntax	`limit_conn_zone key zone=name:size;`
Default	—
Context	HTTP
NGINX configuration sample	limit_conn_zone $binary_remote_addr zone=global_limit_conn_zone:20m; limit_conn_zone $binary_remote_addr zone=openstack_web_conn_zone:10m;

The limit_conn directive sets the shared memory zone and the maximum allowed number of connections for a given key value. When this limit is exceeded, the server returns the error in reply to a request.

Syntax	`limit_conn zone number;`
Default	—
Context	HTTP, server, location
NGINX configuration sample	limit_conn global_limit_conn_zone 100; limit_conn_status 429;

Example of a Salt pillar with limit_conn_zone and limit_conn:

nginx:
  server:
    limit_conn_module:
      limit_conn_zone:
        global_limit_conn_zone:
          key: 'binary_remote_addr'
          size: 20m
          enabled: true
        api_keystone_conn_zone:
          key: 'binary_remote_addr'
          size: 10m
          enabled: true
      limit_conn:
        global_limit_conn_zone:
          connections: 100
          enabled: true
      limit_conn_status: 429

To apply the connection limiting to a particular site, define limit_conn on a site level. For example:

nginx:
  server:
    site:
      nginx_proxy_openstack_api_keystone:
        limit_conn_module:
          limit_conn_status: 429
          limit_conn:
            api_keystone_conn_zone:
              connections: 50
              enabled: true

Configure request limiting for custom locations¶

Note

This feature is available starting from the MCP 2019.2.20 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

If required, you can configure request limiting for custom locations. To do so, add the limit pillar in the location:some_location section. The limits configured for a location predetermine the limits configured for a site. The following methods are supported:

By IP
By HTTP (get, post, and so on)

Configuration sample:

nginx:
  server:
    site:
      nginx_proxy_openstack_api_keystone:
        location:
          /some_location/:
            limit:
              enabled: true
              methods:
                ip:
                  enabled: True
                get:
                  enabled: True
                  rate: 120r/s
                  burst: 600
                  size: 20m
                  nodelay: True
                post:
                  enabled: True
                  rate: 50r/m
                  burst: 80

Configure load balancing for Horizon¶

Starting from the Q4‘18 MCP version, Horizon works in the load balancing mode by default. All requests to Horizon are terminated and forwarded to the Horizon back end by HAProxy bound on a virtual IP address. HAProxy serves as a balancer and manages requests according to the defined policy, which is round-robin by default, among proxy nodes. This approach allows for load reduction on one proxy node and spreading the load among all proxy nodes.

Note

If the node, which the user is connected to, has failed and the user is reconnected to another node, the user will be logged out from the dashboard. As a result, the The user is not authorized page opens, which is the expected behavior in this use case. To continue working with the dashboard, the user has to sign in to Horizon again from the Log In page.

This section provides the instruction on how to manually configure Horizon load balancing for the existing OpenStack deployments that are based on earlier MCP release versions.

To enable active-active mode for Horizon:

Log in to the Salt Master node.
Update to the 2019.2.0 Build ID MCP version or higher.

Verify that the system.apache.server.site.horizon class has been added to your Reclass model. By default, the class is defined in the ./system/apache/server/site/horizon.yml file on the Reclass system level as follows:

parameters:
  _param:
    apache_ssl:
      enabled: false
    apache_horizon_ssl: ${_param:apache_ssl}
    apache_horizon_api_address: ${_param:horizon_server_bind_address}
    apache_horizon_api_host: ${linux:network:fqdn}
  apache:
    server:
      bind:
        listen_default_ports: false
      enabled: true
      default_mpm: event
      modules:
        - wsgi
      site:
        horizon:
          enabled: false
          available: true
          type: wsgi
          name: openstack_web
          ssl: ${_param:apache_horizon_ssl}
          wsgi:
            daemon_process: horizon
            processes: 3
            threads: 10
            user: horizon
            group: horizon
            display_name: '%{GROUP}'
            script_alias: '/ /usr/share/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi'
            application_group: '%{GLOBAL}'
            authorization: 'On'
          limits:
            request_body: 0
          host:
            address: ${_param:apache_horizon_api_address}
            name: ${_param:apache_horizon_api_host}
            port: 8078
          locations:
            - uri: /static
              path: /usr/share/openstack-dashboard/static
          directories:
            dashboard_static:
              path: /usr/share/openstack-dashboard/static
              order: 'allow,deny'
              allow: 'from all'
              modules:
                mod_expires.c:
                  ExpiresActive: 'On'
                  ExpiresDefault: '"access 6 month"'
                mod_deflate.c:
                  SetOutputFilter: 'DEFLATE'
            dashboard_wsgi:
              path: /usr/share/openstack-dashboard/openstack_dashboard/wsgi
              order: 'allow,deny'
              allow: 'from all'
          log:
            custom:
              format: >-
                %v:%p %{X-Forwarded-For}i %h %l %u %t \"%r\" %>s %D %O \"%{Referer}i\" \"%{User-Agent}i\"
            error:
              enabled: true
              level: debug
              format: '%M'
              file: '/var/log/apache2/openstack_dashboard_error.log'

Verify that the system.apache.server.site.horizon has been added to the Reclass system level in the ./system/horizon/server/single.yml file as follows:

classes:
  - service.horizon.server.single
  - system.horizon.upgrade
  - system.horizon.server.iptables
  - system.apache.server.single
  - system.memcached.server.single
  - system.apache.server.site.horizon

Verify that the definition for the system.haproxy.proxy.listen.openstack.openstack_web class has been added to the Reclass cluster level in the in the proxy nodes configuration file:

parameters:
  _param:
    haproxy_openstack_web_check_params: check
  haproxy:
    proxy:
      listen:
        openstack_web:
          type: custom
          check: false
          sticks: ${_param:haproxy_openstack_web_sticks_params}
          binds:
          - address: ${_param:cluster_vip_address}
            port: ${_param:haproxy_openstack_web_bind_port}
          servers:
          - name: ${_param:cluster_node01_hostname}
            host: ${_param:cluster_node01_address}
            port: 8078
            params: ${_param:haproxy_openstack_web_check_params}
          - name: ${_param:cluster_node02_hostname}
            host: ${_param:cluster_node02_address}
            port: 8078
            params: ${_param:haproxy_openstack_web_check_params}

Add the system.haproxy.proxy.listen.openstack.openstack_web class to the Horizon node configuration file, for example, cluster/<cluster_name>/openstack/dashboard.yml:
```
classes:
  - system.haproxy.proxy.listen.openstack.openstack_web
```

In the Horizon node configuration file (edited in the previous step), define the host names and IP addresses for all proxy nodes used in the deployment for the dashboard node and verify that the HAProxy checks the availability of Horizon.

Congifuration example for two proxy nodes:

parameters:
  _param:
    cluster_node01_hostname: ${_param:openstack_proxy_node01_hostname}
    cluster_node01_address: ${_param:openstack_proxy_node01_address}
    cluster_node02_hostname: ${_param:openstack_proxy_node02_hostname}
    cluster_node02_address: ${_param:openstack_proxy_node02_address}
    haproxy_openstack_web_bind_port: ${_param:horizon_public_port}
    haproxy_openstack_web_check_params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3 check-ssl verify none
  horizon:
    server:
      cache:
        ~members:
        - host: ${_param:openstack_proxy_node01_address}
          port: 11211
        - host: ${_param:openstack_proxy_node02_address}
          port: 11211

If the HTTP to HTTPS redirection will be used, add the following configuration to the Horizon node configuration file:

parameters:
  haproxy:
    proxy:
      listen:
        openstack_web_proxy:
          mode: http
          format: end
          force_ssl: true
          binds:
          - address: ${_param:cluster_vip_address}
            port: 80

Disable the NGINX servers requests for Horizon by replacing the NGINX class with the HAProxy class in the proxy node configuration file.

Replace:
```
- system.nginx.server.proxy.openstack_web
```
with
```
- system.haproxy.proxy.single
```
Remove the nginx_redirect_openstack_web_redirect.conf and nginx_proxy_openstack_web.conf Horizon sites from /etc/nginx/sites-enabled/.

Restart the NGINX service on the proxy nodes:

salt 'prx*' cmd.run 'systemctl restart nginx'

Verify that Keepalived keeps track on HAProxy by adding the haproxy variable for the keepalived_vrrp_script_check_multiple_processes parameter:
```
parameters:
  _param:
    keepalived_vrrp_script_check_multiple_processes: 'nginx haproxy'
```

Enable SSL for Horizon:

parameters:
  _param:
    apache_ssl:
      enabled: true
      authority: ${_param:salt_minion_ca_authority}
      engine: salt
      key_file:  /srv/salt/pki/${_param:cluster_name}/${salt:minion:cert:proxy:common_name}.key
      cert_file: /srv/salt/pki/${_param:cluster_name}/${salt:minion:cert:proxy:common_name}.crt
      chain_file: /srv/salt/pki/${_param:cluster_name}/${salt:minion:cert:proxy:common_name}-with-chain.crt

Define the address to be bound by Memcached in the cluster/<cluster_name>/openstack/proxy.yml file:

parameters:
  _param:
    openstack_memcached_server_bind_address: ${_param:single_address}

Verify that the Horizon Salt formula is updated to the the version higher than 2016.12.1+201812072002.e40b950 and the Apache Salt formula is updated to the version higher than 0.2+201811301717.acb3391.
Delete the NGINX sites from the proxy nodes that proxy Horizon requests and possible redirection from HTTP to HTTPS.

Apply the haproxy and horizon states on the proxy nodes:

salt -C 'I@horizon:server' state.sls horizon
salt -C 'I@horizon:server' state.sls haproxy

Expose a hardware RNG device to Nova instances¶

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

MCP enables you to define the path to an Random Number Generator (RNG) device that will be used as the source of entropy on the host. The default source of entropy is /dev/urandom. Other available options include /dev/random and /dev/hwrng.

The example structure of the RNG definition in the Nova pillar:

nova:
  controller:
    libvirt:
      rng_dev_path: /dev/random

  compute:
    libvirt:
      rng_dev_path: /dev/random

The procedure included in this section can be used for both existing and new MCP deployments.

To define the path to an RNG device:

Log in to the Salt Master node.
In classes/cluster/<cluster_name>/openstack/control.yml, define the rng_dev_path parameter for nova:contoroller:
```
nova:
  controller:
    libvirt:
      rng_dev_path: /dev/random
```
In classes/cluster/<cluster_name>/openstack/compute/init.yml, define the rng_dev_path parameter for nova:compute:
```
nova:
  compute:
    libvirt:
      rng_dev_path: /dev/random
```

Apply the changes:

salt -C 'I@nova:controller' state.sls nova.controller
salt -C 'I@nova:compute' state.sls nova.compute

Set the directory for lock files¶

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can set the directory for lock files for the Ceilometer, Cinder, Designate, Glance, Ironic, Neutron, and Nova OpenStack services by specifying the lock_path parameter in the Reclass model. This section provides the example of the lock path configuration for Nova.

To set the lock path for Nova:

Open your Git project repository with the Reclass model on the cluster level.

Define the lock_path parameter:

In openstack/control.yml, specify:

parameters:
  nova:
    controller:
      concurrency:
        lock_path: '/var/lib/nova/tmp'

In openstack/compute.yml, specify:

parameters:
  nova:
    compute:
      concurrency:
        lock_path: '/var/lib/nova/tmp'

Apply the changes from the Salt Master node:

salt -C 'I@nova:controller or I@nova:compute' saltutil.refresh_pillar
salt -C 'I@nova:controller' state.apply nova.controller
salt -C 'I@nova:compute' state.apply nova.compute

Add the Nova CpuFlagsFilter custom filter¶

Note

This feature is available starting from the MCP 2019.2.10 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

CpuFlagsFilter is a custom Nova scheduler filter for live migrations. The filter ensures that the CPU features of a live migration source host match the target host. Use the CpuFlagsFilter filter only if your deployment meets the following criteria:

The CPU mode is set to host-passthrough or host-model. For details, see MCP Deployment Guide: Configure a CPU model.
The OpenStack compute nodes have heterogeneous CPUs.
The OpenStack compute nodes are not organized in aggregates with the same CPU in each aggregate.

To add the Nova CpuFlagsFilter custom filter:

Open your project Git repository with the Reclass model on the cluster level.
Open the classes/cluster/<cluster_name>/openstack/control.yml file for editing.
Verify that the cpu_mode parameter is set to host-passthrough or host-model.

Add CpuFlagsFilter to the scheduler_default_filters parameter for nova:contoroller:

nova:
  controller:
    scheduler_default_filters: “DifferentHostFilter,SameHostFilter,RetryFilter,AvailabilityZoneFilter,RamFilter,CoreFilter,DiskFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,NUMATopologyFilter,AggregateInstanceExtraSpecsFilter,CpuFlagsFilter”

Apply the changes:

salt -C 'I@nova:controller' state.sls nova.controller

Disable Nova cell mapping¶

Note

This feature is available starting from the MCP 2019.2.16 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You may need to disable cell mapping and database migrations for the OpenStack Compute service (Nova), for example, to faster redeploy the nova state.

Note

Enable Nova cell mapping before performing any update or upgrade.

To disable Nova cell mapping:

In defaults/openstack/init.yml on the system Reclass level, set the nova_control_update_cells parameter to False:
```
_param:
  nova_control_update_cells: False
```
Log in to the Salt Master node.

Refresh cache, synchronize Salt pillars and resources:

salt '*' saltutil.clear_cache && \
salt '*' saltutil.refresh_pillar && \
salt '*' saltutil.sync_all

Clean up an OpenStack database¶

Note

This feature is available starting from the MCP 2019.2.12 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Using the Deploy - Openstack Database Cleanup Jenkins pipeline job, you can automatically clean up stale records from the Nova, Cinder, Heat, or Glance database to make it smaller. This is helpful before any update or upgrade activity. You can execute the Deploy - Openstack Database Cleanup Jenkins pipeline job without a maintenance window, just as for an online dbsync.

To clean up an OpenStack database:

Open your Git project repository with the Reclass model on the cluster level.
In <cluster_name>/openstack/control.yml, specify the pillars below with the following parameters:
- Set db_purge to True.
- Set days as required. If you skip setting the days parameter:
  - For Nova and Heat, all stale records will be archived or purged.
  - For Cinder on OpenStack Pike, records older than 1 day will be deleted. For Cinder on OpenStack Queens, all entries will be deleted.
  - For Glance, all entries will be deleted.
- For Nova, set max_rows to limit the rows for deletion. It is safe to specify 500-1000 rows. Unlimited cleanup may cause the database being inaccessible and, as a result, OpenStack being inoperable.
For example:
- For Nova:
```
nova:
  controller:
    db_purge:
      enabled: True
      max_rows: 1000
      days: 45
```
- For Cinder:
```
cinder:
  controller:
    db_purge:
      enabled: True
      days: 45
```
- For Heat:
```
heat:
  server:
    db_purge:
      enabled: True
      days: 45
```
- For Glance ^{Available since 2019.2.17 maintenance update}:
```
glance:
  server:
    db_purge:
      enabled: True
      days: 45
```
Log in to the Jenkins web UI.
Open the Deploy - Openstack Database Cleanup Jenkins pipeline job.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.

Click Deploy.

The Jenkins pipeline job workflow:

For Nova, move the deleted rows from the production tables to shadow tables. Data from shadow tables is purged to save disk space.
For Cinder, purge the database entries that are marked as deleted.
For Heat, purge the database entries that are marked as deleted.
For Glance, purge the database entries that are marked as deleted.

Galera operations¶

This section describes the Galera service operations you may need to perform after the deployment of an MCP cluster.

Configure arbitrary Galera parameters¶

Note

This feature is available starting from the MCP 2019.2.13 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to configure arbitrary Galera parameters using the wsrep_provider_options Galera variable. For details, see Galera parameters.

To configure arbitrary Galera parameters:

Open your project Git repository with the Reclass model on the cluster level.

In openstack/database/init.yml, specify wsrep_provider_options as required. For example:

parameters:
    ...
  galera:
    master:
      wsrep_provider_options:
        gcomm.thread_prio: "rr:2"
    slave:
      wsrep_provider_options:
        gcomm.thread_prio: "rr:2"
     ...

Apply the changes to Galera:

salt -C 'I@galera:master or I@galera:slave' saltutil.refresh_pillar
salt -C 'I@galera:master or I@galera:slave' state.apply galera

Enable Cinder coordination¶

Note

This feature is available starting from the MCP 2019.2.16 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

To avoid race conditions in the OpenStack Block Storage service (Cinder) with an active/active configuration, you can use a coordination manager system with MySQL as a back end. For example, to prevent deletion of a volume that is being used to create another volume or prevent from attaching a volume that is already being attached.

To enable Cinder coordination:

From the Salt Master node, verify that the salt-formula-cinder package version is 2016.12.1+202108101137.19b6edd~xenial1_all or later.
From an OpenStack controller node, verify that the python-tooz package version is 1.60.2-1.0~u16.04+mcp4 or later.
Open the cluster level of your deployment model.

In <cluster_name>/openstack/control.yml, specify the following configuration:

cinder:
  controller:
    coordination:
      enabled: true
      backend: mysql

Apply the changes:

salt -C 'I@cinder:controller' state.sls cinder

Kubernetes operations¶

Caution

Kubernetes support termination notice

Starting with the MCP 2019.2.5 update, the Kubernetes component is no longer supported as a part of the MCP product. This implies that Kubernetes is not tested and not shipped as an MCP component. Although the Kubernetes Salt formula is available in the community driven SaltStack formulas ecosystem, Mirantis takes no responsibility for its maintenance.

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

This section includes topics that describe operations with your Kubernetes environment.

Monitor connectivity between the Kubernetes nodes using Netchecker¶

Caution

Kubernetes support termination notice

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

The Mirantis Cloud Platform automatically deploys Netchecker as part of an MCP Kubernetes Calico-based deployment. Netchecker enables network connectivity and network latency monitoring for the Kubernetes nodes.

This section includes topics that describe how to configure and use Netchecker.

View Netchecker metrics¶

MCP automatically configures Netchecker during the deployment of the Kubernetes cluster. Therefore, Netchecker starts gathering metrics as soon as the Kubernetes cluster is up and running. You can view Netchecker metrics to troubleshoot connectivity between the Kubernetes nodes.

To view Netchecker metrics:

Obtain the IP address of the Netchecker server pod:

kubectl get pod -o json --selector='app==netchecker-server' -n \
netchecker | grep podIP

Obtain the Netchecker container port number:

kubectl get pod -o json --selector='app==netchecker-server' -n \
netchecker | grep containerPort

View all metrics provided by Netchecker:

curl <netchecker-pod-ip>:<port>/metrics

View the list of Netchecker agents metrics:

curl <netchecker-pod-ip>:<port>/metrics | grep ncagent

Example of system response:

# HELP ncagent_error_count_total Total number of errors (keepalive miss
# count) for the agent.
# TYPE ncagent_error_count_total counter
ncagent_error_count_total{agent="cmp01-private_network"} 0
ncagent_error_count_total{agent="cmp02-private_network"} 0
ncagent_error_count_total{agent="ctl01-private_network"} 0
ncagent_error_count_total{agent="ctl02-private_network"} 0
ncagent_error_count_total{agent="ctl03-private_network"} 0

...

For the list of Netchecker metrics, see: Netchecker metrics description.

Netchecker metrics description¶

The following table lists Netchecker metrics. The metrics with the ncagent_ prefix are used to monitor the Kubernetes environment.

**Netchecker metrics**¶
Metric	Description
`go_*`	A set of default golang metrics provided by the Prometheus library.
`process_*`	A set of default process metrics provided by the Prometheus library.
`ncagent_report_count_total` (label agent)	A counter that calculates the number of total reports from every Netchecker agent separated by label.
`ncagent_error_count_total` (label agent)	A counter that calculates the number of total errors from every agent separated by label. Netchecker increases the value of the counter each time the Netchecker agent fails to send a report within the `reporting_interval * 2` timeframe.
`ncagent_http_probe_connection_result`	A gauge that represents the connection result between an HTTP server and a Netchecker agent. Possible values: 0 - error, 1 - success.
`ncagent_http_probe_code`	A gauge that represents the HTTP status code. Returns 0 if there is no HTTP response.
`ncagent_http_probe_total_time_ms`	A gauge that represents the total duration of an HTTP transaction.
`ncagent_http_probe_content_transfer_time_ms`	A gauge that represents the duration of content transfer from the first response byte till the end (in ms).
`ncagent_http_probe_tcp_connection_time_ms`	A gauge that represents the TCP connection establishing time in ms.
`ncagent_http_probe_dns_lookup_time_ms`	A gauge that represents the DNS lookup time in ms.
`ncagent_http_probe_connect_time_ms`	A gauge that represents connection time in ms.
`ncagent_http_probe_server_processing_time_ms`	A gauge that represents the server processing time in ms.

Transition to containers¶

Caution

Kubernetes support termination notice

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

Transitioning from virtual machines to containers is a lengthy and complex process that in some environments may take years. If you want to leverage Kubernetes features while you continue using the existing applications that run in virtual machines, Mirantis provides an interim solution that enables running virtual machines orchestrated by Kubernetes.

To enable Kubernetes to run virtual machines, you need to deploy and configure a virtual machine runtime for Kubernetes called Virtlet. Virtlet is a Kubernetes Container Runtime Interface (CRI) implementation that is packaged as a Docker image and contains such components as libvirt daemon, QEMU/KVM wrapper, and so on.

Virtlet enables you to run unmodified QEMU/KVM virtual machines that do not include an additional containerd layer as in similar solutions in Kubernetes. Virtlet supports all standard Kubernetes objects, such as ReplicaSets, deployments, DaemonSets, and so on, as well as their operations. For information on operations with Kubernetes objects, see: Kubernetes documentation.

Unmodified QEMU/KVM virtual machines enable you to run:

Unikernels
Applications that are hard to containerize
NFV workloads
Legacy applications

Compared to regular Kubernetes pods, Virtlet pods have the following limitations:

Only one virtual machine per pod is allowed.
Virtual machine volumes (pod volumes) must be specified using the FlexVolume driver. Standard Kubernetes directory-based volumes are not supported except for the use case of Kubernetes Secrets and ConfigMaps. If a Secret or a ConfigMap is mounted to a VM pod, its content is copied into an appropriate location inside the VM using the cloud-init mechanism.
No support for kubectl exec.

For details on Virtlet operations, see: Virtlet documentation.

This section describes how to create and configure Virtlet pods as well as provides examples of pods for different services.

For an instruction on how to update Virtlet, see Update Virtlet.

Prerequisites¶

To have a possibility to run virtual machines as Kubernetes pods, your environment must meet the following prerequisites:

An operational Kubernetes environment with enabled Virtlet functionality.
SELinux and AppArmor must be disabled on the Kubernetes nodes.
The Kubernetes node names must be resolvable by the DNS server configured on the Kubernetes nodes.

Example of a pod configuration¶

You need to define a pod for each virtual machine that you want to place under Kubernetes orchestration.

Pods are defined as .yaml files. The following text is an example of a pod configuration for a VM with Virtlet:

apiVersion: v1
kind: Pod
metadata:
  name: cirros-vm
  annotations:
    kubernetes.io/target-runtime: virtlet
    VirtletVCPUCount: "1"
    VirtletSSHKeys: |
      ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCaJEcFDXEK2Zb
      X0ZLS1EIYFZRbDAcRfuVjpstSc0De8+sV1aiu+dePxdkuDRwqFt
      Cyk6dEZkssjOkBXtri00MECLkir6FcH3kKOJtbJ6vy3uaJc9w1E
      Ro+wyl6SkAh/+JTJkp7QRXj8oylW5E20LsbnA/dIwWzAF51PPwF
      7A7FtNg9DnwPqMkxFo1Th/buOMKbP5ZA1mmNNtmzbMpMfJATvVy
      iv3ccsSJKOiyQr6UG+j7sc/7jMVz5Xk34Vd0l8GwcB0334MchHc
      kmqDB142h/NCWTr8oLakDNvkfC1YneAfAO41hDkUbxPtVBG5M/o
      7P4fxoqiHEX+ZLfRxDtHB53 me@localhost
    VirtletCloudInitUserDataScript: |
      #!/bin/sh
      echo "Hi there"
spec:
 affinity:
   nodeAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       nodeSelectorTerms:
       - matchExpressions:
         - key: extraRuntime
           operator: In
           values:
           - virtlet
containers:
- name: cirros-vm
  image: virtlet/download.cirros-cloud.net/0.3.5/
  cirros-0.3.5-x86_64-disk.img
  resources:
    limits:
      memory: 128Mi

The following table describes the pod configuration parameters:

**Pod definition paramaters**¶
Parameter	Description
`apiVersion`	Version of the Kubernetes API.
`kind`	Type of the file. For all pod configurations, the `kind` parameter value is `Pod`.
`metadata`	Specifies a number of parameters required for the pod configuration, including: `name` - the name of the pod. `annotations` - a subset of metadata parameters in the form of strings. Numeric values must be quoted: `kubernetes.io/target-runtime` - defines that this pod belongs to the Virtlet runtime. `VirtletVCPUCount` - (optional) specifies the number of virtual CPUs. The default value is 1. `VirtletSSHKeys` - one or many SSH keys, one key per line. `VirtletCloudInitUserDataScript` - user data for the `cloud-init` script.
`spec`	Pod specification, including: `nodeAffinity` - the specification in the example above ensures that Kubernetes runs this pod only on the nodes that have the `extraRuntime=virtlet` label. This label is used by the Virtlet DaemonSet to select nodes that must have the Virtlet runtime. `containers` - a container configuration that includes: `name` - name of the container. `image` - specifies the path to a network location where the Docker image is stored. The path must start with the `virtlet` prefix followed by the URL to the required location. `resources` - defines the resources, such as memory limitation for the libvirt domain.

Example of a pod definition for an ephemeral device¶

Virtlet stores all ephemeral volumes in the local libvirt pool storage in var/lib/virtlet/volumes. The volume configuration is defined under the volume section.

The following text is an example of a pod (virtual machine) definition with an ephemeral volume of 2048 MB capacity.

 apiVersion: v1
 kind: Pod
 metadata:
   name: test-vm-pod
   annotations:
     kubernetes.io/target-runtime: virtlet
 spec:
   affinity:
   nodeAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       nodeSelectorTerms:
       - matchExpressions:
         - key: extraRuntime
           operator: In
           values:
           - virtlet
   containers:
   - name: test-vm
     image: download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
     volumeMounts:
  - name: containerd
    mountPath: /var/lib/containerd
volumes:
- name: containerd
  flexVolume:
    driver: "virtlet/flexvolume_driver"
    options:
      type: qcow2
      capacity: 2048MB

Example of a pod definition for a Ceph RBD¶

If your virtual machines store data in Ceph, you can attach Ceph RBDs to the virtual machines under Kubernetes control by specifying the required RBDs in the virtual machine pod definition. You do not need to mount the devices in the container.

Virtlet supports the following options for Ceph RBD devices:

Option	Paramater
`FlexVolume` `driver`	`kubernetes.io/flexvolume_driver`
`Type`	`ceph`
`Monitor`	`ip:port`
`User`	`user-name`
`Secret`	`user-secret-key`
`Volume`	`rbd-image-name`
`Pool`	`pool-name`

The following text is an example of a virtual machine pod definition with one Ceph RBD volume:

apiVersion: v1
kind: Pod
metadata:
  name: cirros-vm-rbd
  annotations:
    kubernetes.io/target-runtime: virtlet
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: extraRuntime
            operator: In
            values:
          - virtlet
containers:
  - name: cirros-vm-rbd
    image: virtlet/image-service.kube-system/cirros
    volumeMounts:
    - name: test
      mountPath: /testvol
volumes:
  - name: test
    flexVolume:
      driver: kubernetes.io/flexvolume_driver
      options:
        Type: ceph
        Monitor: 10.192.0.1:6789
        User: libvirt
        Secret: AQDTwuVY8rA8HxAAthwOKaQPr0hRc7kCmR/9Qg==
        Volume: rbd-test-image
        Pool: libvirt-pool

Example of a pod definition for a block device¶

If you want to mount a block device that is available in the /dev/ directory on the Kubernetes node, you can specify the raw device in the pod definition.

The following text is an example of pod definition with one block device. In this example, the path to the raw device is /dev/loop0 which means that a disk is associated with the path on the Virtlet node.

apiVersion: v1
kind: Pod
metadata:
  name: test-vm-pod
  annotations:
    kubernetes.io/target-runtime: virtlet
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
        - key: extraRuntime
          operator: In
          values:
          - virtlet
containers:
  - name: test-vm
    image: download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
    volumeMounts:
  - name: raw
    mountPath: /rawvol
volumes:
- name: raw
  flexVolume:
    driver: "virtlet/flexvolume_driver"
    options:
      type: raw
      path: /dev/loop0

Reprovision the Kubernetes Master node¶

Caution

Kubernetes support termination notice

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

If the Kubernetes Master node became non-operational and recovery is not possible, you can reprovision the node from scratch.

When reprovisioning a node, you can not update some of the configuration data:

Hostname and FQDN - because it breaks Calico.
Node role - for example, from Kubernetes Master to Node role. However, you can use the kubectl label node command to reset a node labels later.
Network plugin - for example, from Calico to Weave.

You can change the following information:

Host IP(s)
MAC addresses
Operating system
Application certificates

Caution

All Master nodes must serve the same apiserver certificate. Otherwise, service tokens will become invalidated.

To reprovision the Kubernetes Master node:

Verify that MAAS works properly and provides the DHCP service to assign an IP address and bootstrap an instance.
Verify that the target nodes have connectivity with the Salt Master node:
```
salt 'ctl[<NUM>]*' test.ping
```
Update modules and states on the new Minion of the Salt Master node:
```
salt 'ctl[<NUM>]*' saltutil.sync_all
```
Note

The ctl[<NUM>] parameter is the ID of a failed Kubernetes Master node.
Create and distribute SSL certificates for services using the salt state:
```
salt 'ctl[<NUM>]*' state.sls salt
```

Install Keepalived:

salt 'ctl[<NUM>]*' state.sls keepalived -b 1

Install HAProxy and verify its status:

salt 'ctl[<NUM>]*' state.sls haproxy
salt 'ctl[<NUM>]*' service.status haproxy

Install etcd and verify the cluster health:

salt 'ctl[<NUM>]*' state.sls etcd.server.service
salt 'ctl[<NUM>]*' cmd.run "etcdctl cluster-health"

Install etcd with the SSL support:

salt 'ctl[<NUM>]*' state.sls salt.minion.cert,etcd.server.service
salt 'ctl[<NUM>]*' cmd.run '. /var/lib/etcd/configenv && etcdctl cluster-health'

Install Kubernetes:

salt 'ctl[<NUM>]*' state.sls kubernetes.master.kube-addons
salt 'ctl[<NUM>]*' state.sls kubernetes.pool

Set up NAT for Calico:

salt 'ctl[<NUM>]*' state.sls etcd.server.setup

Run master to check consistency:

salt 'ctl[<NUM>]*' state.sls kubernetes exclude=kubernetes.master.setup

salt 'ctl[<NUM>]*' --subset 1 state.sls kubernetes.master.setup

Kubernetes Nodes operations¶

Caution

Kubernetes support termination notice

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

This section contains the Kubernetes Nodes-related operations.

Add a Kubernetes Node automatically¶

MCP DriveTrain enables you to automatically scale up the number of Nodes in your MCP Kubernetes cluster if required.

Note

Currently, only scaling up is supported using the Jenkins pipeline job. Though, you can scale down the number of Kubernetes Nodes manually as described in Remove a Kubernetes Node.

To scale up a Kubernetes cluster:

Log in to the Jenkins web UI as an administrator.

Note

To obtain the password for the admin user, run the salt "cid*" pillar.data _param:jenkins_admin_password command from the Salt Master node.
Find the deployment pipeline job that you used to successfully deploy your Kubernetes cluster. For deployment details, refer to the Deploy a Kubernetes cluster procedure. You will reuse the existing deployment pipeline job to scale up your existing Kubernetes cluster.
Select the Build with Parameters option from the drop-down menu of the pipeline job.

Reconfigure the following parameters:

**Kubernetes parameters to scale up the existing cluster**¶
Parameter	Description
`STACK_COMPUTE_COUNT`	The number of Kubernetes Nodes to be deployed by the pipeline job. Configure as required for your use case.
`STACK_NAME`	The Heat stack name to reuse.
`STACK_REUSE`	Select to reuse the existing Kubernetes deployment that requires scaling up.

Click Build to launch the pipeline job.

As a result of the deployment pipeline job execution, your existing Kubernetes cluster will be scaled up during the Scale Kubernetes Nodes stage as configured. The preceding stages of the workflow will be executed as well to ensure proper configuration. However, it will take a significantly shorter period of time to execute these stages, as most of the operations have been already performed during the initial cluster deployment.

Add a Kubernetes Node manually¶

This section describes how to manually add a Kubernetes Node to your MCP cluster to increase the cluster capacity, for example.

To add a Kubernetes Node manually:

Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.
Log in to the Salt Master node.
Verify that salt-minion is running on the target node and this node appears in the list of the Salt keys:
```
salt-key
```
Example of system response:
```
cmp0.bud.mirantis.net
cmp1.bud.mirantis.net
cmp2.bud.mirantis.net
```

Apply the Salt states to the target node. For example, to cmp2:

salt 'cmp2*' saltutil.refresh_pillar
salt 'cmp2*' saltutil.sync_all
salt 'cmp2*' state.apply salt
salt 'cmp2*' state.apply linux,ntp,openssh,git
salt 'cmp2*' state.sls kubernetes.pool
salt 'cmp2*' service.restart 'kubelet'
salt 'cmp2*' state.apply salt
salt '*' state.apply linux.network.host

If Virtlet will run on the target node, add the node label:

salt -C 'I@kubernetes:master and 01' \
cmd.run 'kubectl label --overwrite node cmp2 extraRuntime=virtlet'

Verify that the target node appears in the list of the cluster nodes and is in the Ready state:

kubectl get nodes

Example of system response:

NAME STATUS ROLES AGE  VERSION
cmp0 Ready  node  54m  v1.10.3-3+93532daa6d674c
cmp1 Ready  node  54m  v1.10.3-3+93532daa6d674c
cmp2 Ready  node  2m   v1.10.3-3+93532daa6d674c

Remove a Kubernetes Node¶

This section describes how to remove a Kubernetes Node from your MCP cluster.

To remove a Kubernetes Node:

Stop and disable the salt-minion service on this node:

systemctl stop salt-minion
systemctl disable salt-minion

Log in to the Salt Master node.
Verify that the node name is not registered in salt-key. If the node is present, remove it:
```
salt-key | grep <node_name><NUM>
salt-key -d <node_name><NUM>.domain_name
```
Log in to any Kubernetes Master node.
Mark the Node as unschedulable to prevent new pods from being assigned to it:
```
kubectl cordon <node_ID>
kubectl drain <node_ID>
```
Remove the Kubernetes Node:
```
kubectl delete node cmp<node_ID>
```
Wait until the workloads are gracefully deleted and the Kubernetes Node is removed.
Verify that the node is absent in the Kubernetes Nodes list:
```
kubectl get nodes
```
Open your Git project repository with Reclass model on the cluster level.
In infra/config.yml, remove the definition of the Kubernetes Node in question under the reclass:storage:node pillar.
Log in to the Kubernetes Node in question.

Run the following commands:

systemctl stop kubelet
systemctl disable kubelet

Reprovision a Kubernetes Node¶

You may need to reprovision a failed Kubernetes Node. When reprovisioning a Kubernetes Node, you can not update some of the configuration data:

Hostname and FQDN - because it breaks Calico.
Node role - for example, from Kubernetes Master to Node role. However, you can use the kubectl label node command to reset a node labels later.

You can change the following information:

Host IP(s)
MAC addresses
Operating system
Application certificates

Caution

All Master nodes must serve the same apiserver certificate. Otherwise, service tokens will become invalidated.

To reprovision a Kubernetes Node:

In the MAAS web UI, make the required changes to the target Kubernetes Node.
Verify that MAAS works properly and provides the DHCP service to assign IP addresses and bootstrap an instance.
Proceed with the Add a Kubernetes Node manually procedure starting the step 2.

Use the role-based access control (RBAC)¶

Caution

Kubernetes support termination notice

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

After you enable the role-based access control (RBAC) on your Kubernetes cluster as described in Deployment Guide: Enable RBAC, you can start controlling system access to authorized users by creating, changing, or restricting user or services roles as required. Use the kubernetes.control.role state to orchestrate the role and role binding.

The following example illustrates a configuration of a brand-new role and role binding for a service account:

control:
  role:
    etcd-operator:
      kind: ClusterRole
      rules:
        - apiGroups:
            - etcd.coreos.com
          resources:
            - clusters
          verbs:
            - "*"
        - apiGroups:
            - extensions
          resources:
            - thirdpartyresources
          verbs:
            - create
        - apiGroups:
            - storage.k8s.io
          resources:
            - storageclasses
          verbs:
            - create
        - apiGroups:
            - ""
          resources:
            - replicasets
          verbs:
            - "*"
      binding:
        etcd-operator:
          kind: ClusterRoleBinding
          namespace: test # <-- if no namespace, then it is ClusterRoleBinding
          subject:
            etcd-operator:
              kind: ServiceAccount

The following example illustrates a configuration of the test edit permissions for a User in the test namespace:

kubernetes:
  control:
    role:
      edit:
        kind: ClusterRole
        # No rules defined, so only binding will be created assuming role
        # already exists.
        binding:
          test:
            namespace: test
            subject:
              test:
                kind: User

OpenContrail operations¶

This section describes how to configure and use the OpeContrail-related features. To troubleshoot the OpenContrail services, refer to Troubleshoot OpenContrail.

Verify the OpenContrail status¶

To ensure that OpenContrail is up and running, verify the status of all the OpenContrail-related services. If any service fails, restart it as described in Restart the OpenContrail services.

For OpenContrail 4.x

Log in to the Salt Master node.
Apply one of the following states depending on the Build ID of your MCP cluster:
- For MCP Build ID 2018.11.0 or later:
```
salt -C 'ntw* or nal*' state.sls opencontrail.upgrade.verify
```
  If the state is applied successfully, it means that all OpenContrail services are up and running.
- For MCP Build ID 2018.8.0 or 2018.8.0-milestone1:
```
salt -C 'ntw* or nal*' cmd.run 'doctrail all contrail-status'
```
  In the output, all services must be either in active or backup (for example, for contrail-schema, contrail-svc-monitor, contrail-device-manager services) state.

For OpenContrail 3.2

Eventually, all services should be active except for contrail-device-manager, contrail-schema, and contrail-svc-monitor. These services are in the active state at only one OpenContrail controller ntw node in the cluster switching dynamically between the nodes in case of a failure. On two other OpenContrail controller nodes, these services are in the backup state.

To verify the OpenContrail services status, apply the following state for the OpenContrail ntw and nal nodes from the Salt Master node:

salt -C 'ntw* or nal*' cmd.run 'contrail-status'

Example of system response:

== Contrail Control ==
supervisor-control:           active
contrail-control              active
contrail-control-nodemgr      active
contrail-dns                  active
contrail-named                active

== Contrail Analytics ==
supervisor-analytics:         active
contrail-analytics-api        active
contrail-analytics-nodemgr    active
contrail-collector            active
contrail-query-engine         active
contrail-snmp-collector       active
contrail-topology             active

== Contrail Config ==
supervisor-config:            active
contrail-api:0                active
contrail-config-nodemgr       active
contrail-device-manager       initializing
contrail-discovery:0          active
contrail-schema               initializing
contrail-svc-monitor          initializing
ifmap                         active

== Contrail Web UI ==
supervisor-webui:             active
contrail-webui                active
contrail-webui-middleware     active

== Contrail Database ==
supervisor-database:          active
contrail-database             active
contrail-database-nodemgr     active

== Contrail Support Services ==
supervisor-support-service:   active
rabbitmq-server               active

Restart the OpenContrail services¶

You may need to restart an OpenContrail service, for example, during an MCP cluster update or upgrade when a service failure is caused by the asynchronous restart order of the OpenContrail services after the kvm nodes update or reboot.

All OpenContrail 4.x services run as the systemd services in a Docker container.

All OpenContrail 3.2 services are managed by the process supervisord. The supervisord daemon is automatically installed with the OpenContrail packages including the following OpenContrail Supervisor groups of services:

supervisor-database
supervisor-config
supervisor-analytics
supervisor-control
supervisor-webui

To restart the OpenContrail 4.x services:

Log in to the Salt Master node.
Restart the required service on the corresponding OpenContrail node using the following example:
```
salt 'ntw03' cmd.run 'doctrail controller service contrail-api restart'
```
Note

For a list of OpenContrail containers names to be used by the doctrail utility, see: The doctrail utility for the OpenContrail containers in OpenStack.
If restarting of a service in question does not change its failed status, proceed to further troubleshooting as described in Troubleshoot OpenContrail. For example, to troubleshoot Cassandra not starting, refer to Troubleshoot Cassandra for OpenContrail 4.x.

To restart the OpenContrail 3.2 services:

Select from the following options:

To restart the OpenContrail services group as a whole Supervisor, use the service <supervisor_group_name> restart command. For example:
```
service supervisor-control restart
```
To restart individual services inside the Supervisor group, use the service <supervisor_group_service_name> restart command. For example:
```
service contrail-config-nodemgr restart
```

To identify the services names inside a specific OpenContrail Supervisor group, use the supervisorctl -s unix:///tmp/supervisord_<group_name>.sock status command. For example:

supervisorctl -s unix:///tmp/supervisord_database.sock status

Example of system response:

contrail-database                RUNNING    pid 1349, uptime 2 days, 21:12:33
contrail-database-nodemgr        RUNNING    pid 1347, uptime 2 days, 21:12:33

supervisorctl -s unix:///tmp/supervisord_config.sock status

contrail-api:0                   RUNNING    pid 49848, uptime 2 days, 20:11:54
contrail-config-nodemgr          RUNNING    pid 49845, uptime 2 days, 20:11:54
contrail-device-manager          RUNNING    pid 49849, uptime 2 days, 20:11:54
contrail-discovery:0             RUNNING    pid 49847, uptime 2 days, 20:11:54
contrail-schema                  RUNNING    pid 49850, uptime 2 days, 20:11:54
contrail-svc-monitor             RUNNING    pid 49851, uptime 2 days, 20:11:54
ifmap                            RUNNING    pid 49846, uptime 2 days, 20:11:54

supervisorctl -s unix:///tmp/supervisord_analytics.sock status

contrail-analytics-api           RUNNING    pid 1346, uptime 2 days, 21:13:17
contrail-analytics-nodemgr       RUNNING    pid 1340, uptime 2 days, 21:13:17
contrail-collector               RUNNING    pid 1344, uptime 2 days, 21:13:17
contrail-query-engine            RUNNING    pid 1345, uptime 2 days, 21:13:17
contrail-snmp-collector          RUNNING    pid 1341, uptime 2 days, 21:13:17
contrail-topology                RUNNING    pid 1343, uptime 2 days, 21:13:17

supervisorctl -s unix:///tmp/supervisord_control.sock status

contrail-control                 RUNNING    pid 1330, uptime 2 days, 21:13:29
contrail-control-nodemgr         RUNNING    pid 1328, uptime 2 days, 21:13:29
contrail-dns                     RUNNING    pid 1331, uptime 2 days, 21:13:29
contrail-named                   RUNNING    pid 1333, uptime 2 days, 21:13:29

supervisorctl -s unix:///tmp/supervisord_webui.sock status

contrail-webui                   RUNNING    pid 1339, uptime 2 days, 21:13:44
contrail-webui-middleware        RUNNING    pid 1342, uptime 2 days, 21:13:44

Access the OpenContrail web UI¶

Your OpenContrail cluster may not use SSH overall because of not having a certificate authority available. By default, OpenContrail uses SSL and requires certificate authentication. If you attempt to access the OpenContrail UI through the proxy with such configuration, the UI will accept your credentials but will end up in logging you out immediately. As a workaround, you can use HTTP directly to the OpenContrail web UI management VIP bypassing the proxy.

To access the OpenContrail web UI:

Obtain the Administrator credentials. Select from the following options depending on your cluster type:
- For OpenStack:
  1. Log in to the Salt Master node.
  2. Apply the following state:
    salt 'ctl01*' cmd.run 'cat /root/keystonerc'
  3. From the output of the command above, record the values of OS_USERNAME and OS_PASSWORD.
- For Kubernetes:
  1. Log in to any OpenContrail controller node.
  2. Run the following command:
    cat /etc/contrail/contrail-webui-userauth.js | grep "auth.admin"'
  3. From the output of the command above, record the values of auth.admin_user and auth.admin_password.
In a browser, type either the OpenStack controller node VIP or the Kubernetes controller node VIP on port 8143. For example, https://172.31.110.30:8143.
On the page that opens, configure your web browser to trust the certificate if you have not done so yet:
- In Google Chrome or Chromium, click Advanced > Proceed to <URL> (unsafe).
- In Mozilla Firefox, navigate to Advanced > Add Exception, enter the URL in the Location field, and click Confirm Security Exception.
Note

For other web browsers, the steps may vary slightly.
Enter the Administrator credentials obtained in the step 1. Leave the Domain field empty unless the default configuration was customized.
Click Sign in.

Configure route targets for external access¶

Configuring the OpenContrail route targets for your Juniper MX routers allows extending the private network outside the MCP cloud.

To configure route targets for external access:

Log in to the OpenContrail web UI as described in Access the OpenContrail web UI.
Navigate to Configure > Networking > Networks.
Click the gear icon of the network that you choose to be external and select Edit.
In the Edit window:
1. Expand Advanced Options.
2. Select the Shared and External check boxes.
3. Expand Route Target(s).
4. Click the + symbol to add ASN and Target.
5. Enter the corresponding numbers set during provisioning of the Juniper MX router that is used in your MCP cluster.
6. Click Save.
Verify the route targets configuration:
1. Navigate to Configure > Infrastructure > BGP Routers.
2. Expand one of the BGP Router or Control Node nodes menu.
3. Verify that Autonomous System fits ASN set in the previous steps.

Enable Long Lived Graceful Restart in OpenContrail¶

Warning

Enabling LLGR causes restart of the Border Gateway Protocol (BGP) peerings.

Enabling of Long Lived Graceful Restart (LLGR) must be performed on both sides of peering - edge gateways and the OpenContrail control plane.

To enable LLGR:

Add the following lines to the router configuration file:

set protocols bgp group <name> family inet-vpn unicast graceful-restart long-lived restarter stale-time 20
set protocols bgp group <name> graceful-restart restart-time 1800
set protocols bgp group <name> graceful-restart stale-routes-time 1800

Commit the configuration changes to the router.
Open your GitHub MCP project repository.
Add the following lines to cluster/<name>/opencontrail/control.yml:
```
classes:
...
- system.opencontrail.client.resource.llgr
...
```
Commit and push the changes to the project Git repository.
Log in to the Salt Master node.
Pull the latest changes of the cluster model and the system model that has the system.opencontrail.client.resource.llgr class defined.
Update the salt-formula-opencontrail package.

Apply the opencontrail state:

salt -C 'I@opencontrail:config and *01*' state.sls opencontrail.client

Use the OpenContrail API client¶

The contrail-api-cli command-line utility interacts with the OpenContrail API server that allows searching for or modifying API resources as well as supports the unix-style commands. For more information, see the Official contrail-api-cli documentation.

This section contains the following topics:

Install the OpenContrail API client
Access the OpenContrail API client
The contrail-api-cli-extra package

Install the OpenContrail API client¶

To install contrail-api-cli:

Install the Python virtual environment for contrail-api-cli:

apt-get install python-pip python-dev -y &&\
pip install virtualenv && \
virtualenv contrail-api-cli-venv && \
source contrail-api-cli-venv/bin/activate && \
git clone https://github.com/eonpatapon/contrail-api-cli/ && \
cd contrail-api-cli;sudo python setup.py install

Access the OpenContrail API client¶

To access the OpenContrail API:

Use the keystonerc file with credentials and endpoints:

source /root/keystonerc
source /root/keystonercv3

Connect to the OpenContrail API using the following command:

contrail-api-cli --host 10.167.4.20 --port 9100 shell

Or you can use your OpenStack credentials. For example:

contrail-api-cli --os-user-name admin --os-password workshop  \
--os-auth-plugin v2password --host 10.10.10.254 --port 8082 --protocol http \
--insecure --os-auth-url http://10.10.10.254:5000/v2.0 --os-tenant-name admin shell

Note

MCP uses the 9100 port by default, whereas the OpenContrail API standard port is 8082.
For the ln command, define a schema in the --schema-version 3.1 parameter. The known versions are the following: 2.21, 3.2, 3.0, 1.10, 3.1.

The contrail-api-cli-extra package¶

The contrail-api-cli-extra package contains the contrail-api-cli commands to make the OpenContrail installation and operation process easier.

The commands are grouped in different sub-packages and have different purposes:

clean: detect and remove bad resources
fix: detect and fix bad resources
migration: handle data migration when upgrading OpenContrail to a major version
misc: general-purpose commands
provision: provision and configure an OpenContrail installation

To install the contrail-api-cli-extra package:

Run the following command:

pip install contrail-api-cli-extra

The most used contrail-api-cli-extra sub-packages are the following:

contrail-api-cli_extra.clean - since the sub-package allows removing resources, you must explicitly load the contrail_api_cli.clean namespace to run the commands of this sub-package.

Example of usage:
```
contrail-api-cli --ns contrail_api_cli.clean <command>
# usage
contrail-api-cli --host 10.167.4.21 --port 9100 --ns contrail_api_cli.clean shell
```
This package includes the clean-<type> command. Replace type with the required type of cleaning process. For example:
- clean-route-target
- clean-orphaned-acl
- clean-si-scheduling
- clean-stale-si
contrail_api_cli_extra.fix - allows you to verify and fix misconfigured resources. For example, fix multiple security groups or association of a subnet with a virtual network (VN) in a key-value store.

If this sub-package is installed, it launches with contrail-api-cli automatically.

Example of usage:
```
fix-vn-id virtual-network/600ad108-fdce-4056-af27-f07f9faa5cae --zk-server 10.167.4.21

fix-zk-ip --dry-run --zk-server 10.167.4.21:2181 vitrual-network/xxxxxx-xxxxxxx-xxxxxx
```

Define aging time for flow records¶

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To prevent high memory consumption by vRouter on highly loaded clusters, you can define the required aging time for flow records. Flows are aged depending on the inactivity for a specific period of time. By default, the timeout value is 180 seconds. You can configure the timeout depending on your cluster load needs using the flow_cache_timeout parameter for the contrail-vrouter-agent service.

To configure flow_cache_timeout:

Log in to the Salt Master node.
In classes/cluster/<cluster_name>/opencontrail/compute.yml of your Reclass model, define the required value in seconds for the flow_cache_timeout parameter:
```
parameters:

  opencontrail:
    ...
    compute
      ...
      flow_cache_timeout: 180
      ...
```

Apply the changes:

salt -C 'I@opencontrail:compute' state.apply opencontrail.compute

OpenContrail 4.x-specific operations¶

This section contains the OpenContrail operations applicable to version 4.x.

Modify the OpenContrail configuration¶

The OpenContrail v4.x configuration files are generated by SaltStack and then mounted into containers as volumes. SaltStack also provides a mechanism to apply configuration changes by restarting the systemd services inside a container.

To modify the OpenContrail v4.x configurations:

Log in to the Salt Master node.
Make necessary configuration changes in the classes/cluster/<cluster_name>/opencontrail folder.
Apply the opencontrail state specific to the OpenContrail configuration file where the changes were made. For example, if you made changes in the database.yaml file:
```
salt -C 'I@opencontrail:database' state.apply opencontrail
```
After the state is applied, the systemd services are automatically restarted inside the OpenContrail container in question to apply the configuration changes.

The doctrail utility for the OpenContrail containers in OpenStack¶

In an OpenStack-based MCP environment, the OpenContrail installation includes the doctrail utility that provides an easy access to the OpenContrail containers. Its default installation folder is /usr/bin/doctrail.

The doctrail usage is as follows:

doctrail {analytics|analyticsdb|controller|all} {<command_to_send>|console}

The acceptable destinations used by doctrail are as follows:

controller for the OpenConrail controller container (console, commands)
analytics for the OpenConrail analytics container (console, commands)
analyticsdb for the OpenConrail database of the analytics container (console, commands)
all for all containers on the host (commands only)

The doctrail commands examples:

# Show contrail-status on all containers on this host
doctrail all contrail-status

# Show contrail-status on controller container
doctrail controller contrail-status

# Restart contrail-database on controller container
doctrail controller service contrail-database restart

# Connect to the console of controller container
doctrail controller console

# Connect to the console of analytics container
doctrail analytics console

Set multiple contrail-api workers¶

In the MCP Build ID 2019.2.0, by default, one worker of the contrail-api service is used. Starting from the MCP 2019.2.3 maintenance update, six workers are used by default.

If needed, you can change the default configuration using the instruction below. This section also describes how to stop, start, or restart multiple workers.

To set multiple contrail-api workers for the MCP version 2019.2.3 or later:

Open your Git project repository with the Reclass model on the cluster level.

In cluster/<cluster name>/opencontrail/control.yml, set the required amount of workers:

parameters:
  _param:
    opencontrail_api_workers_count: <required_amount_of_workers>

Refresh pillars:

salt '*' saltutil.refresh_pillar
salt-call state.sls reclass.storage

Apply the Reclass model changes:

salt -C 'I@opencontrail:control' state.apply opencontrail

To set multiple contrail-api workers for the Build ID 2019.2.0:

Caution

Using the configuration below, you can start setting network entities in a newly created OpenStack project only one minute after this project is created.

Open your Git project repository with the Reclass model on the cluster level.

In cluster/<cluster name>/opencontrail/control.yml, set the required amount of workers:

parameters:
  _param:
    opencontrail_api_workers_count: <required_amount_of_workers>

Refresh pillars:

salt '*' saltutil.refresh_pillar
salt-call state.sls reclass.storage

Apply the Reclass model changes:

salt -C 'I@opencontrail:control' state.apply opencontrail

In /etc/contrail/contrail-api.conf, change the following parameter:

[KEYSTONE]
keystone_sync_on_demand=true

[KEYSTONE]
keystone_sync_on_demand=false

Restart the OpenContrail controller Docker container:

cd /etc/docker/compose/opencontrail/; docker-compose down; docker-compose up -d

Wait until all OpenContrail controller services are up and running. To verify the OpenContrail services status, refer to Verify the OpenContrail status.
Repeat the steps 7-9 on the remaining ntw nodes.

To stop, start, or restart multiple workers:

Caution

We recommend that you do not stop, start, or restart the contrail-api workers by executing the service command as it may cause an unstable workers behavior such as an incorrect number of running workers and race conditions.

To stop all contrail-api workers on the target node:
```
systemctl stop contrail-api@*
```
To start all contrail-api workers on the target node:
```
systemctl start contrail-api@*
```
To restart all contrail-api workers on the target node:
```
systemctl restart contrail-api@*
```

Note

You may stop, start, or restart a certain worker by using the worker ID instead of the * character.

Enable SSL for an OpenContrail API internal endpoint¶

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section describes how to manually enable SSL for an internal endpoint of the OpenContrail API.

To enable SSL for an OpenContrail API internal endpoint:

Open your Git project repository with Reclass model on the cluster level.

In cluster/<cluster_name>/opencontrail/init.yml, add the following parameters:

parameters:
  _param:
...
    opencontrail_api_ssl_enabled: True
    opencontrail_api_protocol: https
...

In cluster/<cluster_name>/opencontrail/analytics.yml, cluster/<cluster_name>/opencontrail/control.yml, and cluster/<cluster_name>/opencontrail/compute.yml, add the following class:
```
classes:
...
  - system.salt.minion.cert.opencontrail.api
...
```
Log in to the Salt Master node.

Create certificates on the OpenContrail nodes:

salt -C "I@opencontrail:database or I@opencontrail:compute" state.sls salt.minion.cert

Configure the HAProxy services:

salt -C "I@opencontrail:control" state.apply haproxy

Configure the OpenContrail services:

salt -C "I@opencontrail:database" state.apply opencontrail exclude=opencontrail.compute
salt -C "I@opencontrail:compute" state.apply opencontrail.client

Configure the nodes that run neutron-server:

salt -C "I@neutron:server" state.apply neutron.server

Restart the neutron-server service to use SSL to connect to contrail-api VIP:
```
salt -C "I@neutron:server" service.restart neutron-server
```

Enable or disable VNC API statistics¶

Note

This feature is available starting from the MCP 2019.2.12 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

On every HTTP resource request, such as create, update, list, and so on, the OpenContrail contrail-api server sends statistics messages to the collector service, which can potentially cause high load on analytics nodes. To disable such statistics messages, use the disable_vnc_api_stats option of the contrail-api service. By default disable_vnc_api_stats is set to True.

To enable or disable VNC API statistics:

Log in to the Salt Master node.
In classes/cluster/<cluster_name>/opencontrail/control.yml of your Reclass model, specify the required value for the disable_vnc_api_stats parameter. To enable sending statistics messages to the OpenContrail analytics nodes, set disable_vnc_api_stats to False. To suppress the statistics messages, set disable_vnc_api_stats to True. For example:
```
parameters:

  opencontrail:
    ...
    config
      ...
      disable_vnc_api_stats: True
      ...
```

Apply the changes:

salt -C 'I@opencontrail:control' state.sls_id /etc/contrail/contrail-api.conf opencontrail

Restart the contrail-api service:

Warning

Restarting the contrail-api service on all nodes at once causes interruption for CRUD operations with networking objects. Plan the restart in advance or restart contrail-api on each node one by one using a modified target in the following Salt command.
- If one worker is configured for the OpenContrail API:
```
salt -C 'I@opencontrail:control' cmd.run "doctrail controller service contrail-api* restart"
```
- If multiple workers are configured for the OpenContrail API:
```
salt -C 'I@opencontrail:control' cmd.run "doctrail controller systemctl restart contrail-api@*"
```

Enable or disable VMI and VN statistics collection¶

Note

This feature is available starting from the MCP 2019.2.12 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can enable or disable gathering of statistics, related to virtual machine interfaces (VMIs) and virtual networks (VNs), by the OpenContrail vRouter agent service and sending it to OpenContrail analytics nodes. Such statistics messages can cause heavy load of OpenContrail analytics nodes on large deployments with huge amount of workloads. By default, disable_stats_collection is set to True.

To enable or disable VMI and VN statistics collection:

Log in to the Salt Master node.
In classes/cluster/<cluster_name>/opencontrail/compute.yml of your Reclass model, specify the required value for the disable_stats_collection parameter. For example:
```
parameters:

  opencontrail:
    ...
    compute
      ...
      disable_stats_collection: True
      ...
```

Apply the changes:

salt -C 'I@opencontrail:compute' state.sls_id /etc/contrail/contrail-vrouter-agent.conf opencontrail

Restart the contrail-vrouter-agent service:

Warning

Restarting the contrail-vrouter-agent service causes networking service interruption for workloads on the affected node(s).
```
salt -C 'I@opencontrail:compute' service.restart contrail-vrouter-agent
```

See also

Enable or disable VNC API statistics

DevOps Portal¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

MCP’s Operations Support System (OSS), known as StackLight, now includes DevOps Portal connected to OSS services. DevOps Portal significantly reduces the complexity of Day 2 cloud operations through services and dashboards with a high degree of automation, availability statistics, resource utilization, capacity utilization, continuous testing, logs, metrics, and notifications. DevOps Portal enables cloud operators to manage larger clouds with greater uptime without requiring large teams of experienced engineers and developers.

This solution builds on MCP operations-centric vision of delivering a cloud environment through a CI/CD pipeline with continuous monitoring and visibility into the platform.

The portal collects a comprehensive set of data about the cloud, offers visualization dashboards, and enables the cloud operator to interact with a variety of tools. More specifically, the DevOps Portal includes the following dashboards:

Push Notification¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Push Notification service enables users to send notifications, execute API calls, and open tickets on helpdesk. The service can also connect several systems through a notification queue that transforms and routes API calls from source systems into specific protocols that other systems use to communicate. With the notification system, you can send an API call and transform it into another API call, an email, an AMQP message, or another protocol.

Note

The Push Notification service depends on the following services:

DevOps Portal web UI
Elasticsearch cluster (version 5.x.x or higher)
PostgreSQL database

To view and search for generated notifications, navigate to the Notifications dashboard available in the DevOps Portal UI.

Cloud Health¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Health dashboard is the UI for the Cloud Health Service.

The Cloud Health service collects availability results for all cloud services and failed customer (tenant) interactions (FCI) for a subset of those services. These metrics are displayed so that operators can see both point-in-time health status and trends over time.

Note

The Cloud Health service depends on the following services:

DevOps Portal web UI
Grafana service of StackLight LMA

To view the metrics:

Log in to the DevOps Portal.
Navigate to the Cloud health dashboard.
View the metrics on tabs depending on your needs:
- The Availability tab for the availability results for all cloud services
- The FCI tab for FCI for a subset of cloud services

Cloud Intelligence¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Intelligence service collects and stores data from MCP services, including OpenStack, Kubernetes, bare metal, and others. The data can be queried to enable use cases such as cost visibility, business insights, cost comparison, chargeback/showback, cloud efficiency optimization, and IT benchmarking. Operators can interact with the resource data using a wide range of queries, for example, searching for the last VM rebooted, total memory consumed by the cloud, number of containers that are operational, and so on.

Note

The Cloud Intelligence service depends on the following services:

DevOps Portal web UI
Elasticsearch cluster (version 5.x.x or higher)
Runbook Automation

To start creating queries that will be submitted to the search engine and display the list of resources:

Log in to the DevOps Portal.
Navigate to the Cloud Intelligence dashboard.

Create your query using the cloud intelligence search query syntax:

To search by groups, use:
- type=vm for instances
- type=image for images
- type=flavor for flavors
- type=host for hosts
- type=availabilityZone for availability zones
- type=network for networks
- type=volume for volumes
- type=stack for Heat stacks
- type=tenant for tenants
- type=user for users
To search by field names, specify the field name with the value it contains. For example:

Note

The search by query string is case-insensitive.
- status=active displays all resources with Active in the Status field, meaning they are in active status
- status=(active saving) displays all resources in Active or Saving statuses
- name="test_name" displays all resources which Name fields contain the exact phrase test_name

To group a number of queries in a single one, use the following boolean search operators:

**Type search**¶
Operator	Meaning	Usage example
\|	The pipe symbol stands for OR operation	`minDisk=0 \| minRam=0`
+	The plus symbol stands for AND operation	`minDisk=0 + minRam=0`
-	The minus symbol negates a single token	`minRam=0 + -minDisk=40` searches for resource with `minRam` equal to `0` and `minDisk` not equal to 40 at the same time
( )	Parentheses signify grouping and precedence	`(minDisk=0 minRam=0) + minDisk=40`

To search for the reserved characters, escape them with \. The whole list of these characters includes + - = & | > < ! ( ) { } [ ] ^ " ~ * ? : \ /.

View search results:
- An item name and most important properties are visible by default.
- To view the full item properties list, click on the item block.
Note

If you have the Cleanup service enabled, the Create Janitor rule button is available for the groups that Janitor supports like VMs, Images, and Tenants. The button provides the same functionality as the Create new rule button on the Janitor dashboard with the conditions list prefilled with item-specific properties.
Export search results into JSON, YAML, and CSV formats using the corresponding buttons on the Cloud Intelligence dashboard. The exported data contains the original query and the resulting groups with their items.

Cloud Vector¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Vector dashboard uses a node graph to represent a cloud environment in a form of a cloud map. The entities that build the map include availability zones (AZs), hosts, VMs, and services. Each host represents a compute node in a particular AZ with all VMs and services running on it. Thereby, a cloud map enables you to easily identify the number of nodes running in your cloud environment.

The screen capture below is an example of a cloud map created by Cloud Vector.

Note

The Cloud Vector dashboard depends on the following services:

DevOps Portal web UI
Cloud Intelligence service

To use the Cloud Vector dashboard:

Log in to the DevOps Portal.
Navigate to the Cloud Vector dashboard.
Proceed with the following available actions as required:
- Collapse child elements:
  
  Note
  
  Hosts with more than 50 child VMs are collapsed by default.
  
  Note
  
  The size of a host circle depends on the number of its child elements. The more VMs a host owns, the bigger it is.
  - Double-click on an AZ or a host to collapse its child elements. If a host is collapsed, the number of its VMs is displayed. Services are not collapsed when you collapse a host.
  - Use the slider to collapse the nodes which VMs count matches the specified conditions.
- Expand child elements:
  - Double-click on a collapsed element to expand its child elements.
  - Click Expand all to expand all collapsed elements.
- Drag elements on the canvas:
  - Drag a particular element to move it and all connected elements.
  - Drag the canvas background to change the position of all elements.
  - Click Reset zomming to reset canvas shifts.
- Scale elements on the canvas:
  
  Note
  
  Red borders appear if elements are extended beyond the canvas boundaries.
  - Click on the canvas and scroll up or down to zoom in or out.
  - Click Reset zomming to reset scaling.
- Show and hide node labels:
  - Use toggles to show or hide labels of particular entities.
  - Hover over a particular element to view its label.

Runbooks¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Runbooks Automation service enables operators to create a workflow of jobs that get executed at specific time intervals or in response to specific events. For example, operators can automate periodic backups, weekly report creations, specific actions in response to failed Cinder volumes, and so on.

Note

The Runbooks Automation service is not a lifecycle management tool, appropriate for reconfiguring, scaling, or updating MCP itself as these operations are exclusively performed with DriveTrain.

Note

The Runbooks Automation service depends on the following services:

DevOps Portal web UI
PostgreSQL database

Using the Runbooks dashboard, operators can call preconfigured jobs or jobs workflows and track the execution status through the web UI.

Before you proceed with the dashboard, you may need to configure your own jobs or reconfigure the already existing ones to adjust them to special needs of your installation.

Configure Rundeck jobs¶

Rundeck enables you to easily add jobs to the Runbook Automation service as Rundeck jobs and chain them into workflows. Once the jobs are created and added to your Reclass model, you execute them through the DevOps Portal UI.

Create users¶

To configure Users in the Rundeck service:

Use the following structure to configure users through pillar parameters:

parameters:
  rundeck:
    server:
      users:
        user1:
          name: USER_NAME_1
          password: USER_PWD_1
          roles:
            - user
            - admin
            - architect
            - deploy
            - build
        user2:
          name: USER_NAME_2
          password:  USER_PWD_2
          roles:
            - user
            - deploy
            - build
        ...

Note

Currently, the default access control list (ACL) properly supports only admin users. Therefore, the user2 user in the configuration structure above will not be able to run jobs and view projects.

Create API tokens for non-interactive communications with Rundeck API. To configrue tokens, specify the required parameters in metadata. For example:
```
parameters:
  rundeck:
    server:
      tokens:
        admin: token0
        User2: token4
```
Apply the rundeck.server state:
```
salt-call state.sls rundeck.server
```

Restart the Rundeck service:

docker service update --force rundeck_rundeck

Create projects¶

To create Projects in the Rundeck service:

Use the following structure to configure projects through pillar parameters:

parameters:
  rundeck:
    client:
      project:
        test_project:
          description: PROJECT_DECRIPTION
          node:
            node01:
              nodename: NODE_NAME
              hostname: HOST_NAME
              username: USER_NAME
              tags: TAGS

For example:

parameters:
  rundeck:
    client:
      project:
        test_project:
          description: "Test project"
          node:
            node01:
              nodename:node-1
              hostname: 10.20.0.1
              username: runbook
              tags: [cicd, docker]
            node02:
              nodename: node-2
              hostname: 10.20.0.2
              username: runbook
              tags: [cicd, docker]
            node03:
              Nodename: node-3
              hostname: 10.20.0.3
              username: runbook
              tags: [cicd, docker]

All configured nodes in a particular project are available to run jobs and commands within this project. Also, nodes can be tagged that allows for filtering when executing commands and jobs on nodes.

The Rundeck metadata has a preconfigured user to access other nodes. The user is called runbook. You need to configure it on nodes before you can use jobs or commands in projects. To configure the runbook user, inherit classes of your nodes from the following class specifying the rundeck_runbook_public_key and rundeck_runbook_private_key parameters:
```
classes:
  - system.rundeck.client.runbook
```

Apply the linux and openssh states:

salt '*' state.sls linux.system.user
salt '*' state.sls openssh.server

Configure jobs importing¶

You can configure a Git repository for each project and store Rundeck jobs in the repository. The following extract is an example of a Rundeck job that you can define in the Git repository for importing:

- description: Shows uptime of the system.
  executionEnabled: true
  group: systools
  loglevel: INFO
  name: uptime
  nodeFilterEditable: false
  nodefilters:
    dispatch:
      excludePrecedence: true
      keepgoing: true
      rankOrder: ascending
      threadcount: 1
    filter: tags:cicd
  nodesSelectedByDefault: true
  options:
  - enforced: true
    name: pretty
    values:
    - -p
  scheduleEnabled: true
  sequence:
    commands:
    - exec: uptime ${option.pretty}
    keepgoing: false
    pluginConfig:
      WorkflowStrategy:
        node-first: null
    strategy: node-first

This approach has the following limitations:

Changes introduced using the git commit --amend command are not supported.
The name parameter in job definition files is required. The value of the name parameter may differ from the file name and determines the resulting name of a job.
You can configure not more than one remote repository per project
An incorrect or non-existent branch definition may not result in a Salt configuration error leading to an empty job list.
The Salt state may not recover jobs if you have specified the branch incorrectly. For example, if the jobs are lost due to the incorrect branch definition, the synchronization of jobs may be lost even if the correct branch is defined later and the Salt state is restarted.

To configure job importing for a project:

To use a remote repository as a source of jobs, extend the project’s metadata as required. A minimal configuration includes the address parameter for the import plugin:

parameters:
  rundeck:
    client:
      project:
        test_project:
          plugin:
            import:
              address: https://github.com/path/to/repo.git

A complete list of all available parameters for the import plugin includes:

**Import plugin parameters**¶
Parameter and default value	Values	Description
`address: https://github.com/path/to/repo.git`	String	A valid Git URL (Required)
`branch: master`	String	The name of a repository branch (Optional)
`import_uuid_behavior: remove`	String `preserve`, `remove`, or `archive`	The UUID importing mode in job descriptions (Optional)
`format: yaml`	String `yaml` or `xml`	The extension of files containing job definitions (Optional)
`path_template: ${job.group}${job.name}.${config.format}`	String	The pattern to recognize job definition files (Optional)
`file_pattern: '.*\.yaml'`	Regex	The regex that filters jobs for importing (Optional)

Example of the import plugin configuration with all the available parameters:

parameters:
  rundeck:
    client:
      project:
        test_project:
          plugin:
            import:
              address: https://github.com/akscram/rundeck-jobs.git
              branch: master
              import_uuid_behavior: remove
              format: yaml
              path_template: ${job.group}${job.name}.${config.format}
              file_pattern: '.*\.yaml'

Apply the Rundeck client state:
```
salt-call state.sls rundeck.client
```

Configure iFrame forwarding¶

By default, the Rundeck service configuration does not enable you to get access through an external proxy address and exposed rundeck port, which is 14440 by default. Although, you can easily forward the Runbooks dashboard through a proxy endpoint in case of using Devops Portal through external proxy networks.

To configure iFrame forwarding:

Configure iFrame forwarding on the cluster level by specifying the following parameters in the oss/client.yml:

rundeck_forward_iframe: True
rundeck_iframe_host: <external-proxy-endpoint>
rundeck_iframe_port: <external-proxy-port>
rundeck_iframe_ssl: False

Apply the updated rundeck.server formula:

salt -C 'I@rundeck:server' state.sls rundeck.server

Verify that there are no cached modules, grains, and so on; and minion configuration is updated:
```
salt '*' saltutil.clear_cache
salt -C 'I@docker:swarm:role:master' state.sls salt
```

Refresh and update your deployment:

salt '*' saltutil.refresh_beacons
salt '*' saltutil.refresh_grains
salt '*' saltutil.refresh_modules
salt '*' saltutil.refresh_pillar
salt '*' saltutil.sync_all

Recreate the Rundeck stack:

docker stack rm rundeck
salt -C 'I@docker:swarm:role:master' state.sls docker.client
salt -C 'I@rundeck:client' state.sls rundeck.client

Specify a custom endpoint for the DevOps portal on the cluster level of the Reclass model in the oss/client.yml file:

devops_portal:
  config:
    service:
      rundeck:
        endpoint:
          address: ${_param:rundeck_iframe_host}
          port: ${_param:rundeck_iframe_port}
          https: ${_param:rundeck_iframe_ssl}

Recreate the DevOps portal stack:

docker stack rm devops-portal
salt -C 'I@devops_portal:config' state.sls devops_portal.config
salt -C 'I@docker:swarm:role:master' state.sls docker.client

Now, you can add an additional configuration for proxying the defined address and apply it on the proxy nodes.

Configure an external datasource¶

You can enable the Runbooks automation service to use an external datasource through the Salt metadata. This section explains how to configure the service to use the PostgreSQL database as an external source for the datastore.

To enable the PostgreSQL database support:

Define the following parameters on the cluster level of your Reclass model in the oss/client.yml file:

parameters:
  _param:
    rundeck_postgresql_username: rundeck
    rundeck_postgresql_password: password
    rundeck_postgresql_database: rundeck
    rundeck_postgresql_host: ${_param:control_vip_address}
    rundeck_postgresql_port: 5432
  rundeck:
    server:
      datasource:
        engine: postgresql
        host: ${_param:rundeck_postgresql_host}
        port: ${_param:rundeck_postgresql_port}
        username: ${_param:rundeck_postgresql_username}
        password: ${_param:rundeck_postgresql_password}
        database: ${_param:rundeck_postgresql_database}

Recreate Rundeck and PostgreSQL stacks:

docker stack rm postgresql rundeck
salt -C 'I@rundeck:server' state.sls rundeck.server
salt -C 'I@docker:swarm:role:master' state.sls docker.client
salt -C 'I@postgresql:client' state.sls postgresql.client
salt -C 'I@rundeck:client' state.sls rundeck.client

Verify that the Rundeck tables exist in PostgreSQL by logging as a Rundeck user in to PostgreSQL from the monitoring node where the OSS services are running and checking the log output for the base_report table. For example:

psql -h <postgresql_ip> -U rundeck -W password
rundeck=> \d
              List of relations
 Schema |            Name            |   Type   |  Owner
--------+----------------------------+----------+---------
 public | auth_token                 | table    | rundeck
 public | base_report                | table    | rundeck
 public | execution                  | table    | rundeck
 public | hibernate_sequence         | sequence | rundeck
 public | log_file_storage_request   | table    | rundeck
 public | node_filter                | table    | rundeck
 public | notification               | table    | rundeck
 public | orchestrator               | table    | rundeck
 public | plugin_meta                | table    | rundeck
 public | project                    | table    | rundeck
 public | rdoption                   | table    | rundeck
 public | rdoption_values            | table    | rundeck
 public | rduser                     | table    | rundeck
 public | report_filter              | table    | rundeck
 public | scheduled_execution        | table    | rundeck
 public | scheduled_execution_filter | table    | rundeck
 public | storage                    | table    | rundeck
 public | workflow                   | table    | rundeck
 public | workflow_step              | table    | rundeck
 public | workflow_workflow_step     | table    | rundeck
(20 rows)

rundeck=> select * from base_report;

Run preconfigured jobs from web UI¶

The Rundeck jobs and workflows are run automatically depending on configuration. Though, the DevOps Portal enables you to run the preconfigured Rundeck jobs and workflows using the web UI and track the progress of their execution.

To run a Rundeck job:

Log in to the DevOps Portal.
Navigate to the Runbooks dashboard.
Select the project you are interested in.
Navigate to the Jobs tab in the top navigation bar. The jobs page will display all jobs you are authorized to view.
If the jobs were defined inside groups, they will appear as a listing grouped into a folder. To reveal a folder content, press the folder icon.
Navigate to a required job, and clck on it. The job details page opens. This page contains the configuration parameters for this job as well as statistics, activity, and definition details for it.
To run the job, click Run job now.
Once you have started the job execution, follow the job’s output in the Execution follow page.

See also

DriveTrain

Security¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Security dashboard is the UI for the Security Audit Service, aka Security Monkey. The service runs tests to track and evaluate security-related tenant changes and configurations. Using the Security dashboard, you can search through these audit findings and review them.

Note

The Security Audit service depends on the following services:

DevOps Portal web UI
Push Notification service
PostgreSQL database

To review security item details:

Log in to the DevOps Portal.
To quickly access current security items with unjustified issues click the Security issues widget on the Dashboard tab. The Security items page opens.
Click the name of the required security item in the Items section to view its details. The item details page opens.
To revise the configuration change that caused the security item raise, use the Revisions section. The affected parts of configuration are color coded:
- Green stands for additions
- Red stands for deletions
To justify the unjustified issues, use the Issues section:
1. Check the unjustified issue or issues.
2. Edit the Justification field specifying the reason of justification. For example, This has been approved by the Security team.
3. Click Justify.
To attach a comment containing any required content to an item:
1. In the Item comments section, paste the comment to the comment field.
2. Click Add comment.
To search for specific issues, use the Issues section. Each issue has a link to a page of the corresponding item containing the details of the issue.

Janitor¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Janitor dashboard is the UI for the Cleanup service, also known as Janitor Monkey.

The Cleanup service is a tool that enables you to automatically detect and clean up unused resources in your MCP deployment that may include:

OpenStack resources: virtual machines
AWS resources: instances, EBS volumes, EBS snapshots, auto scaling groups, launch configurations, S3 bucket, security groups, and Amazon Machine Images (AMI)

The architecture of the service allows for easy configuration of the scanning schedule as well as a number of other operations. This section explains how to configure the Cleanup service to fit your business requirements as well as how to use the Janitor dashboard in the DevOps Portal.

Note

The Cleanup service depends on the following services:

DevOps Portal web UI
Push Notification service
Cloud Intelligence service

Overview of the resource termination workflow¶

The resource termination workflow includes the following stages:

Determining and marking the clean-up candidate.

The Cleanup service applies a set of pre-configured rules to the resources available in your cluster on a regular basis. If any of the rules is hold, the resource becomes a clean-up candidate and is marked accordingly.
Deleting the resource.

The Cleanup service deletes the resource if the resource is marked as a clean-up candidate and the scheduled termination time passes. The resource owner can manually delete the resource to release it earlier.

Configure the scanning schedule¶

By default, the Janitor service scans your MCP cluster for unused resources once every hour from 11:00 a.m. to 11:59 p.m. in the Pacific Time Zone (PT) on week days. Though, you can easily override the default schedule by defining the simianarmy.properties environment variables depending on the needs of your environment.

**The Janitor service schedule parameters**¶
Parameter	Default	Description
`simianarmy.scheduler.frequency`	`1`	The parameter is connected to the `frequencyUnit` parameter determining the scanning cycle. The `1` `frequency` together with the `HOURS` `frequencyUnit` means that the scanning will be performed once every hour.
`simianarmy.scheduler.frequencyUnit`	`HOURS`	The available values include the `java.util.concurrent.TimeUnit` enum values.
`simianarmy.calendar.openHour`	`11`	Sets the time when the service starts performing any action (scheduling, deleting) on week days.
`simianarmy.calendar.closeHour`	`11`	Sets the time when the service stops performing any action (scheduling, deleting) on week days.
`simianarmy.calendar.timezone`	`America/Los_Angeles`	The time zone which the Janitor service operates in.

To configure the scanning schedule:

In the classes/cluster/${_param:cluster_name}/oss/client.yml file of the Reclass model, define the Cleanup service schedule parameters as required. For example:

docker:
  client:
    stack:
      janitor_monkey:
        environment:
          simianarmy.scheduler.frequency: 3
          simianarmy.scheduler.frequencyUnit: MINUTES

To apply the changes, recreate the stack:

salt -C 'I@docker:swarm:role:master' cmd.run 'docker stack rm janitor_monkey'
salt '*' saltutil.refresh_pillar
salt -C 'I@docker:swarm:role:master' state.sls docker.client

Clean up resources using web UI¶

The unused resources are cleaned up automatically according to the termination schedule. If you need to release an unused item earlier, you can terminate it manually using te DevOps Portal web UI.

To clean up resources manually:

Log in to the DevOps Portal web UI.
Navigate to the Janitor > Items tab.
Check the items you need to clean up immediately from the list of resources scheduled for termination.
Click Terminate.

Hardware Correlation¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Hardware (HW) correlation dashboard provides the capability to generate an on-demand dashboard that graphically illustrates the consumption of the physical host resources by VMs running on this host. For example, you can search for a VM and get a dashboard with the CPU, memory, and disk consumption for the compute node where this specific VM is running.

Note

The HW correlation dashboard depends on the following services:

DevOps Portal web UI
Cloud Intelligence service
Prometheus service of StackLight LMA

To use the HW correlation dashboard:

Log in to the DevOps Portal.
Navigate to the HW correlation dashboard.
To generate a dashboard for a compute host(s) on which a specific VM(s) is running:
1. If required, filter the VMs list by tenants where they are running using the VM tenants filter.
2. Select a VM(s) from the VMs drop-down list.
  
  Note
  
  By default, if the VM tenants filter is not used, all available VMs are present in the VMs drop-down list.
3. If required, select resources from the Resources drop-down list.
4. Click Search.
Read the generated dashboard:
- View the name of a compute host to which specified VMs belong and the list of selected VMs which are running on this host.
- Examine the line graphs illustrating the resources consumption. To view the values with their measures, hover over a required line.
  
  Note
  
  The y-axis may contain suffixes, such as K, M, G, and others. These suffixes correspond to prefixes of the units of measurement, such as Kilo, Mega, Giga, and so on depending on the measure.
To scale graphs by y-axis:
- Click Zoom In to set y-axis start to the lowest point of selected lines and the y-axis end to the highest point of selected lines.
- Click Zoom Out to switch to the default view where y-axis starts at 0 and ends at the highest point of all the lines on a chart.
To scale graphs by x-axis:
- Click Expand to expand a graph to the full width.
- Click Collapse to switch to the default view.
To select and hide lines on graphs:
- Click on an item under a graph or on a line itself to view only the selected line. Combine this action with Zoom In for the detailed view.
- Hover over an item under a graph to highlight the related line and mute the others.

DriveTrain¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The DriveTrain dashboard provides access to a custom Jenkins interface. Operators can perform the following operations through the DevOps Portal UI:

View the list of Jenkins jobs by views with job names and last build statuses and descriptions.
View specific Jenkins job information including a list of builds with their statuses and descriptions as well as the stages for the last five builds.
Analyze a specific build information including stages, console output, and artifact list on a build information page.
View a job console output in real time and manage the build flow.
Execute Jenkins jobs with custom parameters and re-run builds.

To perform all the above operations, use the DriveTrain dashboard available in the DevOps Portal UI.

Cloud Capacity Management¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Capacity Management service provides point in time resource consumption data for OpenStack by displaying parameters such as total CPU utilization, memory utilization, disk utilization, and number of hypervisors. The related dashboard is based on data collected by the Cloud Intelligence service and can be used for cloud capacity management and other business optimization aspects.

Note

The Cloud Capacity Management service depends on the following services:

DevOps Portal web UI
Kibana service of StackLight LMA

Heatmaps¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Heatmaps dashboard provides the information about resource consumption by the cloud environment identifying the load of each node and number of alerts triggered for each node. The dashboard includes heatmaps for the following data:

Memory utilization
CPU utilization
Disk utilization
Alerts triggered

Note

The Heatmaps dashboard depends on the following services:

DevOps Portal web UI
Prometheus service of StackLight LMA

To use the Heatmaps dashboard:

Log in to the DevOps Portal.
Navigate to the Heatmaps dashboard.
Switch between the tabs to select a required heatmap.

Each box on a heatmap represents a hypervisor. The box widget is color-coded:
- Green represents a normal load or no alerts triggered
- Orange represents a high load or low number of alerts
- Red represents an overloaded node or a high number of alerts
Specify the parameters for the data to be displayed:
- Use Now, Last 5m, Last 15m, and Last 30m buttons to view data for a specific time period.
- Use the Custom button to set custom time period. The time value format includes the number from 1 to 99 and a metric prefix that is m for minutes, h for hours, d for days, and w for weeks. For example, 12h, 3d, and so on.
- Use the Max button to display a maximum value of resources consumption or number of alerts during the selected period of time.
- Use the Avg button on the Memory, CPU, and Disk tabs to display an average value of resources consumption during the selected period of time.
- Use the Diff button on the Alerts tab to display the count of alerts triggered since the selected period of time.
  
  Note
  
  On the Alerts tab, the 0 count of alerts means that either 0 alerts are triggered or Prometheus failed to receive the requested data for a specific node.

LMA¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The LMA tab provides access to the LMA (Logging, Monitoring, and Alerting) toolchain of the Mirantis Cloud Platform. More specifically, LMA includes:

The LMA > Logging tab to access Kibana
The LMA > Monitoring tab to access Grafana
The LMA > Alerting tab to access the Prometheus web UI

Note

The LMA tab is only available in the DevOps Portal with LMA deployments and depends on the following services:

DevOps Portal web UI
Prometheus, Grafana, and Kibana services of StackLight LMA

LMA Logging dashboard¶

The LMA Logging tab provides access to Kibana. Kibana is used for log and time series analytics and provides real-time visualization of the data stored in Elasticsearch.

To access the Kibana dashboards from the DevOps Portal:

Log in to the DevOps Portal.
Navigate to the LMA > Logging tab.
Use Kibana as described in Manage Kibana dashboards and Use Kibana filters and queries.

LMA Monitoring dashboard¶

The LMA Monitoting tab provides access to the Grafana web service that builds and visually represents metric graphs based on time series databases. A collection of predefined Grafana dashboards contains graphs on particular monitoring endpoints.

To access the Grafana dashboards from the DevOps Portal:

Log in to the DevOps Portal.
Navigate to the LMA > Monitoring tab.
Log in to Grafana.
Select the required dashboard from the Home drop-down menu.

For information about the available Grafana dashboards, see View Grafana dashboards. To hide nodes from dashboards, see Hide nodes from dashboards.

LMA Alerting dashboard¶

The LMA Alerting tab provides access to the Prometheus web UI that enables you to view simple graphs, Prometheus configuration and rules, and states of the monitoring endpoints of your deployment.

To access the Prometheus web UI from the DevOps Portal:

Log in to the DevOps Portal.
Navigate to the LMA > Alerting tab.
Use the upper navigation menu to view alerts, graphs, or statuses. See View graphs and alerts and View Prometheus settings for details.

StackLight LMA operations¶

Using StackLight LMA, the Logging, Monitoring, and Alerting toolchain of the Mirantis Cloud Platform, cloud operators can monitor OpenStack environments, Kubernetes clusters, and OpenContrail services deployed on the platform and be quickly notified of critical conditions that may occur in the system so that they can prevent service downtimes.

This section describes how to configure and use StackLight LMA.

Configure StackLight LMA components¶

Once you deploy StackLight LMA, you may need to modify its components. For example, you may need to configure the Prometheus database, define alerting rules, and so on. The configuration of StackLight LMA is stored in Reclass. Therefore, you must modify the Reclass model and re-execute the Salt formulas.

Configure Telegraf¶

The configuration of the Telegraf agent is stored in the telegraf section of the Reclass model.

To configure Telegraf:

Log in to the Salt Master node.
Configure the telegraf section in the classes/cluster/cluster_name/init.yml file of the Reclass model as required.

Apply the Salt formula:

salt -C 'I@linux:system' state.sls telegraf

Example configuration:

telegraf:
  agent:
    enabled: true
    interval: 15
    round_interval: false
    metric_batch_size: 1000
    metric_buffer_limit: 10000
    collection_jitter: 2
    output:
      prometheus_client:
        bind:
          address: 0.0.0.0
          port: 9126
        engine: prometheus

In the example above, the Reclass model is converted to a configuration file recognized by Telegraf. For details about options, see the Telegraf documentation and the */meta/telegraf.yml file in every Salt formula.

The input and output YAML dictionaries contain a list of defined inputs and outputs for Telegraf. To add input or output parameters to Telegraf, use the same format as used in */meta/telegraf.yml of the required Salt formula. However, this should be performed only by deployment engineers or developers.

Configure Prometheus¶

You may need to configure Prometheus, for example, to modify an existing alert. Prometheus configuration is stored in the prometheus:server section of the Reclass model.

To configure Prometheus:

Log in to the Salt Master node.
Configure the prometheus:server section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model as required.

Update the Salt mine:

salt -C 'I@salt:minion' state.sls salt.minion.grains
salt -C 'I@salt:minion' saltutil.refresh_modules
salt -C 'I@salt:minion' mine.update

Apply the Salt formula:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1

Example configuration:

prometheus:
  server:
    enabled: true
    bind:
      port: 9090
      address: 0.0.0.0
    storage:
      local:
        engine: "persisted"
        retention: "360h"
        memory_chunks: 1048576
        max_chunks_to_persist: 524288
        num_fingerprint_mutexes: 4096
    alertmanager:
      notification_queue_capacity: 10000
    config:
      global:
        scrape_interval: "15s"
        scrape_timeout: "15s"
        evaluation_interval: "1m"
        external_labels:
          region: 'region1'
    alert:
      PrometheusTargetDownKubernetesNodes:
        if: 'up{job="kubernetes-nodes"} != 1'
        labels:
          severity: down
          service: prometheus
        annotations:
          summary: 'Prometheus target down'

The following table describes the available settings.

Settings description¶
Setting	Description
`storage`	The `storage` YAML dictionary stores the configuration options for the Prometheus storage database. These options are passed to the Prometheus server through the command-line arguments.
`config`	The `config` YAML dictionary contains the options that will be placed in the Prometheus configuration file. For more information, see Prometheus configuration documentation.
`alert`	The `alert` YAML dictionary is used to generate Prometheus alerting rules. For more information, see Alerting rules. Alternatively, you can import alerts from the `*/meta/prometheus.yml` file of any Salt formula. However, this should be performed only by deployment engineers or developers.

Caution

The Prometheus data directory is mounted from the Docker host. If you restart a container, it can be spawned on a different host. This can cause Prometheus to start with an empty storage. In such case, the data will still be available on the previous host.

See also

Manage alerts

Configure Prometheus long-term storage¶

You may need to configure Prometheus long-term storage to change the external labels, scrape intervals and timeouts, and so on. Since Prometheus long-term storage and Prometheus Relay are connected, you can use the same configuration file to modify Prometheus Relay, for example, to change the bind port. The configuration of Prometheus long-term storage and Prometheus Relay is stored in the prometheus:server and prometheus:relay sections of the Reclass model.

To configure Prometheus long-term storage and Prometheus Relay:

Log in to the Salt Master node.
Configure the prometheus:server and prometheus:relay sections in the classes/cluster/<cluster_name>/stacklight/telemetry.yml file of the Reclass model as required.

Apply the Salt formula:

salt -C 'I@prometheus:relay' state.sls prometheus

Example configuration of Prometheus long-term storage:

prometheus:
 server:
   dir:
     config: /etc/prometheus
     data: /var/lib/prometheus/data
   bind:
     port: 9090
     address: 0.0.0.0
   storage:
     local:
       retention: 4320h
   config:
     global:
       scrape_interval: 30s
       scrape_timeout: 30s
       evaluation_interval: 15s
       external_labels:
         region: region1

Example configuration of Prometheus Relay:

prometheus
 relay:
   enabled: true
   bind:
     port: 8080
   client:
     timeout: 12

Note

Configuring the timeout for Prometheus Relay is supported starting from the MCP 2019.2.4 maintenance update. To obtain the feature, follow the steps described in Apply maintenance updates.

Configure Alertmanager¶

The configuration of Alertmanager is stored in the prometheus:alertmanager section of the Reclass model. For available configuration settings, see the Alertmanager documentation.

To configure Alertmanager:

Log in to the Salt Master node.
Configure the prometheus:alertmanager section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model as required.

Apply the Salt formula:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' state.sls prometheus.alertmanager

Example configuration:

prometheus:
  alertmanager:
    enabled: true
    bind:
      address: 0.0.0.0
      port: 9093
    config:
      global:
        resolve_timeout: 5m
      route:
        group_by: ['alertname', 'region', 'service']
        group_wait: 60s
        group_interval: 5m
        repeat_interval: 3h
        receiver: HTTP-notification
      inhibit_rules:
        - source_match:
            severity: 'down'
          target_match:
            severity: 'critical'
          equal: ['region', 'service']
        - source_match:
            severity: 'down'
          target_match:
            severity: 'warning'
          equal: ['region', 'service']
        - source_match:
            severity: 'critical'
         target_match:
            severity: 'warning'
          equal: ['alertname', 'region', 'service']
      receivers:
        - name: 'HTTP-notification'
          webhook_configs:
            - url: http://127.0.0.1
              send_resolved: true

Configure the logging system components¶

The logging system components include Fluentd (log collector), Elasticsearch, Elasticsearch Curator, and Kibana. You can modify the Reclass model to configure the logging system components. For example, your can configure Fluentd to gather logs from a custom entity.

Configure Fluentd¶

Fluentd gathers system and service logs and pushes them to the default output destination such as Elasticsearch, file, and so on. You can configure Fluentd to gather logs from custom entities, remove the default entities from the existing Fluentd configuration, as well as to filter and route logs. Additionally, you can configure Fluentd to expose metrics generated from logs to Prometheus.

Configure logs gathering¶

You can configure Fluentd to gather logs from custom entities, remove the default entities from the existing Fluentd configuration, as well as to filter and route logs. During configuration, you can define the following parameters:

input, to gather logs from external sources such as a log file, TCP socket, and so on. For details, see Input plugin overview.
filter, to filter the log entries gathered by the Input plugin. For example, to add, change, or remove fields. For details, see Filter plugin overview.
match, to push final log entries to a given destination such as Elasticsearch, file, and so on. For details, see Output plugin overview.
label, to connect the inputs. Logs gathered by Fluentd are processed from top-to-bottom of a configuration file. The label parameter connects the inputs, filters, and matches into a single flow. Using the label parameter ensures that filters for a given label are defined after input and before match.

Note

Perform all changes in the Reclass model. Add the custom log parsing rules used by a single environment to the cluster model. Place the log parsing rules for all deployments to the /meta/ directory in the Reclass model for a particular node. For details, see the */meta/fluentd.yml file of the required Salt formula.

To configure logs gathering:

On the cluster level, specify the following snippets in the Reclass model for a particular node as required:

To add a new input:

fluentd:
  agent:
    config:
      input:
        file_name:
          input_name:
            parameterA: 10
            parameterB: C
          input_nameB:
            parameterC: ABC

To add a new filter:

fluentd:
  agent:
    config:
      filter:
        file_name:
          filter_name:
            parameterA: 10
            parameterB: C
          filter_nameB:
            parameterC: ABC

To add a new match:

fluentd:
  agent:
    config:
      match:
        file_name:
          match_name:
            parameterA: 10
            parameterB: C
          match_nameB:
            parameterC: ABC

If the service requires a more advanced processing than gathering logs from an external source (input), add a label. For example, if you want to add filtering, use the label parameter that defines the whole flow. All entries in label are optional. So you can define filter and match but skip input.

fluentd:
  agent:
    config:
      label:
        label_name:
          input:
            input1:
              parameter1: abc
            input2:
              parameter1: abc
          filter:
            filter1:
              parameter1: abc
              parameter2: abc
          match:
            match1:
              parameter1: abc

Example:

fluentd:
  agent:
    config:
      label:
        docker:
          input:
            container:
              type: tail
              tag: temp.docker.container.*
              path: /var/lib/docker/containers/*/*-json.log
              path_key: log_path
              pos_file: {{ positiondb }}/docker.container.pos
              parser:
                type: json
                time_format: '%Y-%m-%dT%H:%M:%S.%NZ'
                keep_time_key: false
          filter:
            enrich:
              tag: 'temp.docker.container.**'
              type: record_transformer
              enable_ruby: true
              record:
                - name: severity_label
                  value: INFO
                - name: Severity
                  value: 6
                - name: programname
                  value: docker
          match:
            cast_service_tag:
              tag: 'temp.docker.container.**'
              type: rewrite_tag_filter
              rule:
                - name: log_path
                  regexp: '^.*\/(.*)-json\.log$'
                  result: docker.container.$1
            push_to_default:
              tag: 'docker.container.*'
              type: relabel
              label: default_output

To forward the logs gathered from a custom service to the default output, change the final match statement to default_output.

Example:

fluentd:
  agent:
    config:
      label:
        custom_daemon:
          input:
            …
          match:
            push_to_default:
              tag: 'some.tag'
              type: relabel
              label: default_output

Note

The default output is defined in the system Reclass model. For details, see Default output. All Fluentd labels defined in /meta/ must use this mechanism to ensure log forwarding to the default output destination.

To disable input, filter, match, or label, specify enabled: false for the required Fluentd entity.

Example:

fluentd:
  agent:
    config:
      label:
        docker:
          enabled: false

Apply the following state:
```
salt -C 'node_name' state.sls fluentd
```

Add an additional output for Fluentd¶

If you have a syslog server and want StackLight LMA to send logs to this server, configure an additional output for Fluentd. In this case, Fluentd will push logs both to your syslog server and to Elasticsearch, which is the default target.

To add an additional output for Fluentd:

Download and install the td-agent-additional-plugins package on every host that runs Fluentd:
```
apt-get install --only-upgrade td-agent-additional-plugins
```
Open your Git project repository with the Reclass model on the cluster level.

In the classes/cluster/<cluster_name>/infra/init.yml file, perform the following changes:

Comment the system.fluentd.label.default_output.elasticsearch class.
Copy the default_output parameters and rename to elasticsearch_output.
To apply the existing filters for all outputs, copy the default output filter section to the new default output.
Add syslog_output and specify the parameters as required.

Example:

classes:
- system.fluentd
- system.fluentd.label.default_metric
- system.fluentd.label.default_metric.prometheus
## commented out
#- system.fluentd.label.default_output.elasticsearch
- service.fluentd.agent.output.syslog
parameters:
  fluentd:
    agent:
      plugin:
        fluent-plugin-remote_syslog:
          deb: ['td-agent-additional-plugins']
      config:
        label:
          ## renamed previous default_output -> elasticsearch_output
          elasticsearch_output:
            match:
              elasticsearch_output:
                tag: "**"
                type: elasticsearch
                host: ${_param:fluentd_elasticsearch_host}
                port: ${_param:elasticsearch_port}
          syslog_output:
            match:
              syslog_output:
                tag: "**"
                type: syslog
                host: 127.0.0.1
                port: 514
                ## optional params:
                # format: xxx
                # severity: xxx
                # facility: xxx
                # protocol: xxx
                # tls: xxx
                # ca_file: xxx
                # verify_mode: xxx
                # packet_size: xxx
                # timeout: xxx
                # timeout_exception: xxx
                # keep_alive: xxx
                # keep_alive_idle: xxx
                # keep_alive_cnt: xxx
                # keep_alive_intvl: xxx
          default_output:
           ## copy of previous default_output filter section
            # filter: {}
            match:
              send_to_default:
                tag: "**"
                type: copy
                store:
                  - type: relabel
                    label: syslog_output
                  - type: relabel
                    label: elasticsearch_output

Synchronize Salt modules and refresh Salt pillars:

salt '*' saltutil.sync_all
salt '*' saltutil.refresh_pillar

Apply the following state:
```
salt '*' state.sls fluentd
```

Enable sending CADF events to external SIEM systems¶

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

You can configure Fluentd running on the RabbitMQ nodes to forward the Cloud Auditing Data Federation (CADF) events to specific external security information and event management (SIEM) systems, such as Splunk, ArcSight, or QRadar. The procedure below provides a configuration example for Splunk.

To enable sending CADF events to Splunk:

Open your project Git repository with Reclass model on the cluster level.

In classes/cluster/cluster_name/stacklight, create a custom notification channel, for example, fluentd_splunk.yml with the following pillar specifying the hosts and ports in the splunk_output and syslog_output parameters:

parameters:
  fluentd:
    agent:
      config:
        label:
          audit_messages:
            filter:
              get_payload_values:
                tag: audit
                type: record_transformer
                enable_ruby: true
                record:
                  - name: Logger
                    value: ${fluentd:dollar}{ record.dig("publisher_id") }
                  - name: Severity
                    value: ${fluentd:dollar}{ {'TRACE'=>7,'DEBUG'=>7,'INFO'=>6,\
                    'AUDIT'=>6,'WARNING'=>4,'ERROR'=>3,'CRITICAL'=>2}\
                    [record['priority']].to_i }
                  - name: Timestamp
                    value: ${fluentd:dollar}{ DateTime.strptime(record.dig\
                    ("payload", "eventTime"), "%Y-%m-%dT%H:%M:%S.%N%z").strftime\
                    ("%Y-%m-%dT%H:%M:%S.%3NZ") }
                  - name: notification_type
                   value: ${fluentd:dollar}{ record.dig("event_type") }
                  - name: severity_label
                    value: ${fluentd:dollar}{ record.dig("priority") }
                  - name: environment_label
                    value: ${_param:cluster_domain}
                  - name: action
                    value: ${fluentd:dollar}{ record.dig("payload", "action") }
                  - name: event_type
                    value: ${fluentd:dollar}{ record.dig("payload", "eventType") }
                  - name: outcome
                    value: ${fluentd:dollar}{ record.dig("payload", "outcome") }
              pack_payload_to_json:
                tag: audit
                require:
                  - get_payload_values
                type: record_transformer
                enable_ruby: true
                remove_keys: '["payload", "timestamp", "publisher_id", "priority"]'
                record:
                  - name: Payload
                    value: ${fluentd:dollar}{ record["payload"].to_json }
            match:
              send_to_default:
                tag: "**"
                type: copy
                store:
                  - type: relabel
                    label: splunk_output
                  - type: relabel
                    label: syslog_output
          splunk_output:
            match:
              splunk_output:
                tag: "**"
                type: splunk_hec
                host: <splunk_host>
                port: <splunk_port>
                token: <splunk_token>
          syslog_output:
            match:
              syslog_output:
                tag: "**"
                type: syslog
                host: <syslog_host>
                port: <syslog_port>

In openstack/message_queue.yml:

Replace the system.fluentd.notifications class with the following ones:

classes:
- system.fluentd.label.notifications.input_rabbitmq
- system.fluentd.label.notifications.notifications

Add the custom Fluentd channel as required. For example:
```
cluster.<cluster_name>.stacklight.fluentd_splunk
```

Apply the fluentd state on the msg nodes:

salt -C 'I@rabbitmq:server' state.sls fluentd

Enable Fluentd to expose metrics generated from logs¶

You can enable exposing metrics that are based on the log events. This allows monitoring of various activities such as disk failures (metric hdd_errors_total). By default, Fluentd generates metrics from the logs it gathers. However, you must configure Fluentd to expose such metrics to Prometheus. Prometheus gathers Fluentd metrics as a static Prometheus endpoint. For details, see Add a custom monitoring endpoint. To generate metrics from logs, StackLight LMA uses the fluent-plugin-prometheus plugin.

To configure Fluentd to expose metrics generated from logs:

Log in to the Salt Master node.
Add the following class to the cluster/<cluster_name>/init.yml file of the Reclass model:
```
system.fluentd.label.default_metric.prometheus
```
This class creates a new label default_metric that is used as a generic interface to expose new metrics to Prometheus.

(Optional) Create a filter for metric.metric_name to generate the metric.

Example:

reclass:
fluentd:
  agent:
    label:
      default_metric:
        filter:
          metric_out_of_memory:
            tag: metric.out_of_memory
            type: prometheus
            metric:
              - name: out_of_memory_total
                type: counter
                desc: The total number of OOM.
            label:
              - name: host
                value: ${Hostname}
          metric_hdd_errors_parse:
            tag: metric.hdd_errors
            type: parser
            key_name: Payload
            parser:
              type: regexp
              format: '/(?<device>[sv]d[a-z]+\d*)/'
          metric_hdd_errors:
            tag: metric.hdd_errors
            require:
              - metric_hdd_errors_parse
            type: prometheus
            metric:
              - name: hdd_errors_total
                type: counter
                desc: The total number of hdd errors.
            label:
              - name: host
                value: ${Hostname}
              - name: device
                value: ${device}
      systemd:
        output:
          push_to_default:
            tag: '*.systemd'
            type: copy
            store:
              - type: relabel
                label: default_output
              - type: rewrite_tag_filter
                rule:
                  - name: Payload
                    regexp: '^Out of memory'
                    result: metric.out_of_memory
                  - name: Payload
                    regexp: >-
                      'error.+[sv]d[a-z]+\d*'
                    result: metric.hdd_errors
                  - name: Payload
                    regexp: >-
                      '[sv]d[a-z]+\d*.+error'
                    result: metric.hdd_errors
          push_to_metric:
            tag: 'metric.**'
            type: relabel
            label: default_metric

Configure log rotation¶

Fluentd uses two options to modify the log files rotation, the logrotate parameter that controls log rotation on a daily basis and the internal td_agent_log_rotate_size parameter, which sets the internal log rotation by file size and is set to 10 MB by default. If a log file exceeds this limit, the internal log rotation service of Fluentd applies the log rotation. You can modify td_agent_log_rotate_size if required.

To configure log rotation:

Specify the following parameter in the cluster/<cluster_name>/init.yml file of the Reclass model:

parameters:
  fluentd:
    agent:
      td_agent_log_rotate_size: <custom_value_in_bytes>

Apply the following state:

salt -C 'I@fluentd:agent' state.sls fluentd

Configure Elasticsearch¶

The configuration parameters of Elasticsearch are defined in the corresponding sections of the Reclass model.

To configure Elasticsearch:

Log in to the Salt Master node.
Configure the parameters:elasticsearch section in the classes/cluster/<cluster_name>/stacklight/log.yml file of the Reclass model as required. For example, to limit the heap size, specify the following snippet:
```
parameters:
  elasticsearch:
    server:
      heap:
        size: 31
```

Apply the Salt state:

salt -C 'I@elasticsearch:server' state.sls elasticsearch

For configuration examples, see the README.rst at Elasticsearch Salt formula.

Configure Elasticsearch Curator¶

The Elasticearch Curator tool manages the data (indices) and the data retention policy in Elasticsearch clusters. You can modify the indices and the retention policy.

To configure Elasticsearch Curator:

Open your Reclass model Git repository on the cluster level.
Modify the classes/cluster/<cluster_name>/stacklight/log_curator.yml file as required:
- To configure indices, set the required prefixes using the elasticsearch_curator_indices_pattern parameter. The default value is "^(log|audit)-.*$", meaning that Curator manages the indices with log- and audit- prefixes.
- To configure the retention policy for logs and audit indices, specify the elasticsearch_curator_retention_period parameter. The retention period is set to 31 days by default.
- To configure the retention policy for notification indices, specify the elasticsearch_curator_notifications_retention_period parameter. The retention period is set to 90 days by default.
Log in to the Salt Master node.

Apply the following state:

salt -C 'I@elasticsearch:server' state.sls_id elasticsearch_curator_action_config elasticsearch

Configure Kibana¶

The configuration parameters of Kibana are defined in the corresponding sections of the Reclass model.

To configure Kibana:

Log in to the Salt Master node.
Configure the parameters:kibana section in the classes/cluster/<cluster_name>/stacklight/server.yml of the Reclass model as required.

Apply the Salt state:

salt -C 'I@kibana:server' state.sls kibana

For configuration examples, see the README.rst at Kibana Salt formula.

Configure Grafana¶

The configuration of Grafana is stored in the grafana section of the Reclass model.

To configure Grafana:

Log in to the Salt Master node.
Configure the grafana section in the classes/cluster/<cluster_name>/stacklight/server.yml file of the Reclass model as required.

Apply the Salt formulas:

salt -C 'I@grafana:server' state.sls grafana.server
salt -C 'I@grafana:client' state.sls grafana.client

Example configuration:

grafana:
  server:
    enabled: true
    bind:
      address: 127.0.0.1
      port: 3000
    database:
      engine: mysql
      host: 127.0.0.1
      port: 3306
      name: grafana
      user: grafana
      password: db_pass
    auth:
      basic:
        enabled: true
    admin:
      user: admin
      password: admin_pass
    dashboards:
      enabled: false
      path: /var/lib/grafana/dashboards

Configure InfluxDB¶

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

The configuration of InfluxDB is stored in the parameters:influxdb section of the Reclass model.

To configure InfluxDB:

Log in to the Salt Master node.
Configure the parameters:influxdb section in the classes/cluster/<cluster_name>/stacklight/server.yml file of the Reclass model as required.

Apply the Salt state:

salt -C 'I@influxdb:server' state.sls influxdb

For configuration examples, see the README.rst at InfluxDB Salt formula.

Configure authentication for Prometheus and Alermanager¶

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can configure basic authentication to access Prometheus and Alertmanager web UI through the proxy nodes that are available if external access to cloud resources is enabled in your OpenStack deployment.

This section describes how to configure authentication for Prometheus and Alertmanager by defining new passwords instead of the default ones on the existing MCP deployments updated to 2019.2.7. For new clusters starting from the MCP 2019.2.7 maintenance update, you define custom credentials for Prometheus and Alertmanager during the cluster creation.

To configure authentication for Prometheus and Alertmanager:

Obtain user names and passwords:

For Alertmanager:

salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_alertmanager_password
salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_alertmanager_user

For Prometheus:

salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_server_user
salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_server_password

Change the default credentials for Prometheus and Alertmanager:

Open the classes/cluster/<cluster_name>/stacklight/proxy.yml file for editing.

Specify new passwords using the following parameters:

parameters:
  _param:
     nginx_proxy_prometheus_alertmanager_password: <password>
     nginx_proxy_prometheus_server_password: <password>

Optional. Specify new user names using the following parameters:

parameters:
  _param:
     nginx_proxy_prometheus_alertmanager_user: <user_name>
     nginx_proxy_prometheus_server_user: <user_name>

On all proxy nodes, synchronize Salt modules and apply the nginx state. For example:
```
salt 'prxNode01*' saltutil.sync_all
salt 'prxNode01*' state.sls nginx
```

Verify authentication through the proxy nodes. For example:

salt 'prxNode01*' pillar.get _param:cluster_vip_address
salt 'prxNode01*' pillar.get nginx:server:site:nginx_proxy_prometheus_server:proxy:port
salt 'prxNode01*' pillar.get nginx:server:site:nginx_proxy_prometheus_alertmanager:proxy:port
curl https://<cluster_vip_address>:<prometheus_server_port>
curl -u <username>:<password> https://<cluster_vip_address>:<prometheus_server_port>

Enable Docker garbage collection¶

To avoid unused Docker images and volumes consuming the entire disk space, you can enable a clean-up cron job for old StackLight LMA containers and volumes. By default, the cron job runs daily at 6:00 a.m. and cleans stopped StackLight LMA images and containers that are older than one week.

To enable Docker garbage collection:

Open your Git project repository with the Reclass model on the cluster level.
In classes/cluster/<cluster_name>/stacklight/server.yml, specify the following parameter:
```
_param:
  docker_garbage_collection_enabled: true
```

Optional. To change the default parameters, use:

linux:
  system:
    cron:
      user:
        root:
          enabled: true
    job:
      docker_garbage_collection:
        command: docker system prune -f --filter until=$(date +%s -d "1 week ago")
        enabled: ${_param:docker_garbage_collection_enabled}
        user: root
        hour: 6
        minute: 0

Apply the following state:

salt -C 'I@docker:swarm' state.sls linux.system.cron

Disable HTTP probes for OpenStack public endpoints¶

Note

This feature is available starting from the MCP 2019.2.12 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

By default, Telegraf checks all endpoints from the OpenStack service catalog, including the public, admin, and internal endpoints. In some cases, public endpoints may be unreachable from the Telegraf container. For such cases, you can configure StackLight to skip HTTP probes for OpenStack public endpoints.

To disable or enable HTTP probes for OpenStack public endpoints:

Open your Git project repository with the Reclass model on the cluster level.

In <cluster_model>/stacklight/server.yml, specify the skip_public_endpoints parameter as required:

parameters:
  telegraf:
    remote_agent:
      input:
        openstack_api:
          skip_public_endpoints: true

Log in to the Salt Master node.
Apply the telegraf state to the mon nodes:
```
salt 'mon*' state.sls telegraf
```

Restart StackLight LMA components¶

You may need to restart one of the StackLight LMA components. For example, if its service hangs.

Restart services running in Docker Swarm¶

The Prometheus, Alertmanager, Alerta, Pushgateway, and Grafana services are running in the Docker Swarm mode. This section describes how to restart these services.

To restart services running in Docker Swarm:

Issue one of the following commands depending on the service you want to restart:

To restart Prometheus:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
"docker service update monitoring_server --force"

To restart Alertmanager:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
"docker service update monitoring_alertmanager --force"

To restart Alerta:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
"docker service update monitoring_alerta --force"

To restart Pushgateway:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
"docker service update monitoring_pushgateway --force"

To restart Grafana:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
"docker service update dashboard_grafana --force"

To restart Prometheus Relay:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
"docker service update monitoring_relay --force"

To restart Prometheus Remote Agent:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
"docker service update monitoring_remote_agent --force"

Restart the logging system components¶

The logging system components include Fluentd, Elasticsearch, and Kibana. If required, you can restart these components.

To restart Fluentd

The Fluentd process that is responsible for collecting logs is td-agent (Treasure Data Agent). The td-agent process starts automatically when the fluentd Salt state is applied. To manually start and stop it, use the Salt commands.

The following example shows how to restart the td-agent process from the Salt Master node on all Salt Minion nodes with names that start with ctl:

# salt 'ctl*' service.restart td-agent

Alternatively, SSH to the node and use the service manager (systemd or upstart) to restart the service. For example:

# ssh ctl01.mcp-lab-advanced.local
# service td-agent restart

See the salt.modules.service documentation for more information on how to use the Salt service execution module.

To restart Elasticsearch

Run the following command from the Salt Master node:

# salt 'log*' service.restart elasticsearch

To restart Kibana

Run the following command from the Salt Master node:

# salt 'log*' service.restart kibana

Restart Telegraf¶

The Telegraf service is called telegraf.

To restart Telegraf on all nodes:

Run the following command:

salt -C 'I@telegraf:agent' service.restart telegraf

Restart InfluxDB¶

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

The InfluxDB service is called influxdb.

To restart InfluxDB on all nodes:

Run the following command:

salt -C 'I@influxdb:server' service.restart influxdb -b 1

Restart InfluxDB Relay¶

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

The InfluxDB Relay service is called influxdb-relay.

To restart InfluxDB Relay:

Run the following command:

salt -C 'I@influxdb:server' service.restart influxdb-relay -b 1

Restart Prometheus Relay and Prometheus long-term storage¶

You can restart Prometheus Relay and Prometheus long-term storage, for example, if one of the services hangs. Since these services are connected, you must restart both.

To restart Prometheus Relay and Prometheus long-term storage:

Run the following commands:

salt -C 'I@prometheus:relay' service.restart prometheus
salt -C 'I@prometheus:relay' service.restart prometheus-relay

Manage endpoints, metrics, and alerts¶

You can easily configure Stacklight LMA to support new monitoring endpoints, add custom metrics and alerts, and modify or disable the existing alerts.

Add a custom monitoring endpoint¶

If required, you can add a custom monitoring endpoint to Prometheus, such as Calico, etcd, or Telegraf.

To add a custom monitoring endpoint:

Configure the prometheus:server section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model as required. Add the monitoring endpoint IP and port.

Example:

prometheus:
  server:
    target:
      static:
        endpoint_name:
          endpoint:
            - address: 1.1.1.1
              port: 10
            - address: 2.2.2.2
              port: 10

Apply the Salt formula:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1

Add a custom metric¶

If required, you can add a custom metric, for example, to monitor a third-party software. This section describes how to add a custom metric to Telegraf.

Note

To add a custom metric to a new endpoint, you must first add the new endpoint to Prometheus as described in Add a custom monitoring endpoint.

To add a custom metric to Telegraf:

Edit the telegraf section in the classes/cluster/cluster_name/init.yml file of the Reclass model as required.

Example:

telegraf:
  agent:
    input:
      procstat:
        process:
          memcached:
            exe: memcached
      memcached:
        servers:
          - address: {{ server.bind.address | replace("0.0.0.0", "127.0.0.1") }}
            port: {{ server.bind.port }}

Apply the Telegraf Salt formula:

salt -C 'I@linux:system' state.sls telegraf

Manage alerts¶

You can easily extend StackLight LMA to support a new service check by adding a custom alert. You may also need to modify or disable the default alerts as required.

To create a custom alert:

Add the new alert to the prometheus:server:alert section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model. Enter the alert name, alerting conditions, severity level, and annotations that will be shown in the alert message.

Example:

prometheus:
  server:
    alert:
      EtcdFailedTotalIn5m:
        if: >-
          sum by(method) (rate(etcd_http_failed_total{code!~"4[0-9]{2}"}[5m]))
          / sum by(method) (rate(etcd_http_received_total[5m])) > {{
          prometheus_server.get('alert', {}).get('EtcdFailedTotalin5m', \
          {}).get('var', {}).get('threshold', 0.01) }}
        labels:
          severity: warning
          service: etcd
        annotations:
          summary: 'High number of HTTP requests are failing on etcd'
          description: '{{ $value }}% of requests for {{ $labels.method }} \
          failed on etcd instance {{ $labels.instance }}'

Apply the Salt formula:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1

To view the new alert, see the Prometheus logs:
```
docker service logs monitoring_server
```
Alternatively, see the Alerts tab of the Prometheus web UI.

To modify a default alert:

Log in to the Salt Master node.
Modify the required alert in the prometheus:server:alert section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model.

Apply the Salt formula:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1

To view the changes, see the Prometheus logs:
```
docker service logs monitoring_server
```
Alternatively, see the alert details in the Alerts tab of the Prometheus web UI.

To disable an alert:

Log in to the Salt Master node.
Create the required alert definition in the prometheus:server:alert section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model and set the enabled parameter to false.

Example:
```
prometheus:
  server:
    alert:
      EtcdClusterSmall:
        enabled: false
```

Apply the Salt formula:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1

Verify the changes in the Alerts tab of the Prometheus web UI.

Configure StackLight LMA to send notifications¶

To enable StackLight LMA to send notifications, you must modify the Reclass model. By default, email notifications will be sent. However, you can configure StackLight LMA to send notifications to Salesforce, Slack, or other receivers. Additionally, you can specify a notification receiver for a particular alert or alert types or disable notifications.

Enable Alertmanager notifications¶

To enable StackLight LMA to send notifications, you must modify the Reclass model. By default, email notifications will be sent. However, you can configure StackLight LMA to send notifications to Salesforce, Slack, or other notifications receivers. You can also specify a notification channel for a particular alert or disable notifications.

Enable email or Slack notifications¶

This section describes how to enable StackLight LMA to send notifications to email, Slack, or to both notification channels using the Alertmanager service on an existing MCP cluster. By default, StackLight LMA uses Alertmanager and the SMTP protocol or the webhook receiver to send email or Slack notifications respectively.

Note

Skip this section if you require only email notifications and have already defined the variables for Alertmanager email notifications during the deployment model creation as described in MCP Deployment Guide: Infrastructure related parameters: Alertmanager email notifications.

To enable StackLight LMA to send notifications through Alertmanager:

Log in to the Salt Master node.
Open the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model for editing.

Add the following classes:

For email notifications:

classes:
[...]
- system.prometheus.alertmanager.notification.email
- system.prometheus.server.alert.labels_add.route

For Slack notifications:

classes:
[...]
- system.prometheus.alertmanager.notification.slack
- system.prometheus.server.alert.labels_add.route

Define the following variables:

For email notifications:

parameters:
  _param:
    alertmanager_notification_email_from: <email_from>
    alertmanager_notification_email_host: <smtp_server:port>
    alertmanager_notification_email_password: <email_password>
    alertmanager_notification_email_require_tls: <email_require_tls>
    alertmanager_notification_email_to: <email_to>
    alertmanager_notification_email_username: <email_username>

Note

Using the alertmanager_notification_email_host parameter, specify both the host and the port number of the SMTP server. For example, host.com:25.

For Slack notifications:

parameters:
  _param:
    alertmanager_notification_slack_api_url: https://hooks.slack.com/services/<webhook/integration/token>

Set one or multiple notification channels by using the _param:prometheus_server_alert_label_route parameter. The default value is email, which means that email notifications will be sent.

Example:
```
parameters:
  _param:
    prometheus_server_alert_label_route: email;slack;
```

Apply the Salt formulas:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.alertmanager -b 1

Enable Salesforce notifications¶

This section describes how to enable StackLight LMA to create Salesforce cases from Prometheus alerts on an existing cluster. StackLight LMA uses Alertmanager and the Salesforce notifier service to create the Salesforce cases.

Note

Skip this section if you have already defined the variables for Alertmanager Salesforce notifications during the deployment model creation as described in MCP Deployment Guide: General deployment parameters and MCP Deployment Guide: Infrastructure related parameters: Alertmanager Salesforce notifications.

If you configured Salesforce notifications through the Push Notification service, first proceed to Switch to Alertmanager-based notifications.

To enable StackLight LMA to send Salesforce notifications through Alertmanager:

Open your Git project repository with the Reclass model on the cluster level.

In classes/cluster/<cluster_name>/stacklight/client.yml, specify:

classes:
- system.docker.swarm.stack.monitoring.sf_notifier

[...]

parameters:
  _params:
    docker_image_sf_notifier: "${_param:mcp_docker_registry}/openstack-docker/sf_notifier:${_param:mcp_version}"

In classes/cluster/<cluster_name>/stacklight/server.yml, specify:

classes:
- system.prometheus.alertmanager.notification.salesforce
- system.prometheus.sf_notifier.container

[...]

parameters:
  _params:
    sf_notifier_sfdc_auth_url: "<salesforce_instance_http_endpoint>"
    sf_notifier_sfdc_username: "<customer_account_email>"
    sf_notifier_sfdc_password: "<customer_account_password>"
    sf_notifier_sfdc_organization_id: "<organization_id>"
    sf_notifier_sfdc_environment_id: "<cloud_id>"
    sf_notifier_sfdc_sandbox_enabled: "True/False"

Warning

If you have previously configured email notifications through Alertmanager, verify that the prometheus_server_alert_label_route parameter in server.yml includes not only the email but also salesforce values.

Log in to the Salt Master node.
Refresh Salt pillars:
```
salt '*' saltutil.refresh_pillar
```

Create the directory structure for the Salesforce notifier service:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.sf_notifier

Start the sf-notifier service in Docker container:

salt -C 'I@docker:swarm:role:master' state.sls docker.client

Update the Prometheus configuration to create metrics target and alerts:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b 1

Update the Alertmanager configuration to create the webhook receiver:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.alertmanager -b 1

Configure Alertmanager integrations¶

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable StackLight LMA to send notifications to specific receivers such as PagerDuty, OpsGenie, and so on using the Alertmanager service on an existing MCP cluster. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

The changes performed within the following procedure are backward compatible with previous MCP releases. Therefore, you do not need to change any of the already configured Alertmanager receivers and routes.

To enable Alertmanager integrations:

Open your project Git repository with the Reclass model on the cluster level.
Open the <classes/cluster/<cluster>/stacklight/server.yml> file for editing.

Configure the Alertmanager receiver route as required:

parameters:
  prometheus:
    alertmanager:
      enabled: true
      config:
        route:
          routes:
            <route_name>:
              receiver: <receiver_name>
              match_re:
                - label: '<name_of_the_alert_label>'
                  value: '<regex_to_identify_the_route>'
              continue: true
        receiver:
          <receiver_name>:
            enabled: true
            generic_configs:  # <- here is the difference
              <chosen_Alertmanager_receiver_type>:
                <receiver_endpoint_name>:
                  <receiver_config>

Parameters description¶
Parameter name	Description
`routes`	Specify a unique route name.
`receiver`	Specify a unique receiver name.
`match_re`	In `label`, specify the name of the label to filter the alerts. In `value`, specify a regular expression that Alertmanager will use to route the alert.
`receiver`	Specify a unique receiver name. Set to the same value as `routes`.
`generic_configs`	Specify one of the available Alertmanager receiver types. Specify a unique receiver endpoint name. Specify the endpoint configuration in the YAML format.

Example configuration:

parameters:
  prometheus:
    alertmanager:
      enabled: true
      config:
        route:
          routes:
            opsgenie:
              receiver: HTTP-opsgenie
              match_re:
                - label: route
                  value: '(.*opsgenie.*)'
              continue: true
        receiver:
          HTTP-opsgenie:
            enabled: true
            generic_configs:
              opsgenie_configs:
                opsgenie-endpoint:
                  api_url: "https://example.app.eu.opsgenie.com/"
                  api_key: "opsgeniesecretkey"
                  send_resolved: true

If you have previously configured email notifications through Alertmanager, verify that the prometheus_server_alert_label_route parameter also includes the receiver that you are configuring. For example:
```
parameters:
  _param:
    prometheus_server_alert_label_route: email;opsgenie;
```
Log in to the Salt Master node.
Verify the Reclass model:
```
reclass -i
```
The output should not include any errors.

Verify that the changes render properly:

salt '*mon*' state.sls prometheus.alertmanager test=True

Example of system response for the configuration provided above:

routes:
  - receiver: default
[...]
  # opsgenie
  - receiver: HTTP-opsgenie
    continue: True
    match_re:
      route: '(.*opsgenie.*)'
[...]
receivers:
  - name: 'default'
[...]
  - name: 'HTTP-opsgenie'
    opsgenie_configs:
    # opsgenie-endpoint
    -
      api_key: opsgeniesecretkey
      api_url: https://example.app.eu.opsgenie.com/
      send_resolved: true
[...]

Update the Prometheus and Alertmanager configuration:

salt '*mon*' state.sls prometheus.server -b 1
salt '*mon*' state.sls prometheus.alertmanager

Switch to Alertmanager-based notifications¶

This section describes how to switch from the Push Notification service to the Alertmanager-based notifications. For more information on the Alertmanager-based notifications, see MCP Reference Architecture: StackLight LMA components.

Caution

Before you start, perform the following prerequisite steps:

Upgrade StackLight LMA to the newest version as described in Upgrade StackLight LMA to Build ID 2019.2.0.
For the Salesforce notifications, verify that you have access to the Salesforce instance and have a customer account in Salesforce.

Warning

The Push Notification service uses the md5 hashing algorithm for creating alert IDs for the Salesforce notifications, whereas the Salesforce notifier service uses sha256 by default. Switching to sha256 without migration of Salesforce cases may lead to cases loss from the Salesforce notifier scope. If these services have different hashing set up, case duplication with the same subject and status but different IDs may occur. Therefore, Mirantis recommends explicitly setting the md5 hashing algorithm in your model configuration as described below. Migration of old cases is not supported.

To Switch to the Alertmanager-based notifications:

Open your Git project repository with Reclass model on the cluster level.
Set up the Alertmanager-based notifications:
- For email notifications, follow the procedure described in Enable email or Slack notifications.
- For Salesforce notifications:
  1. Set the md5 hashing algorithm in the <cluster>/stacklight/server.yml file:
    _params: sf_notifier_alert_id_hash_func: md5
  2. Set up the Salesforce notifier service as described in Enable Salesforce notifications.
Disable the Push Notification service:
1. Open the classes/cluster/<cluster_name>/stacklight/server.yml file for editing.
2. Remove the following class:
```
- system.prometheus.alertmanager.notification.pushkin
```
3. Remove the following parameters:
```
alertmanager_notification_pushkin_host: <host>
alertmanager_notification_pushkin_port: <port>
```
Log in to the Salt Master node.
Refresh Salt pillars:
```
salt '*' saltutil.refresh_pillar
```

Apply the following state:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.alertmanager -b 1

(Optional) Remove the Push Notification service:

Verify that only the Push Notification service uses the database in the Docker container.
In the Git project repository with Reclass model on the cluster level, open the classes/cluster/<cluster_name>/stacklight/server.yml file for editing.

Remove the following classes:

- system.haproxy.proxy.listen.oss.pushkin
- system.haproxy.proxy.listen.oss.postgresql
- system.docker.swarm.stack.pushkin
- system.docker.swarm.stack.postgresql
- system.docker.swarm.network.oss_backend
- system.postgresql.client.pushkin
- system.postgresql.client.pushkin.sfdc
- system.glusterfs.server.volume.postgresql
- system.glusterfs.server.volume.pushkin
- system.glusterfs.client.volume.postgresql
- system.glusterfs.client.volume.pushkin

Remove the following parameters:

postgresql_client_user: 'postgres'
postgresql_client_password: 'postgrespassword'

Apply the changes:

salt -C 'I@haproxy:proxy' state.sls haproxy.proxy
salt -C 'I@docker:swarm and I@prometheus:server' state.sls docker.client

Remove the Push Notification service from Docker Swarm:

salt -C 'I@docker:swarm and I@prometheus:server' cmd.run 'docker stack rm pushkin'

Remove the PostgreSQL service from Docker Swarm:

salt -C 'I@docker:swarm and I@prometheus:server' cmd.run 'docker stack rm  postgresql'

Remove the Docker images for the Push Notification service and PostgreSQL:

salt -C 'I@docker:swarm and I@prometheus:server' cmd.run 'docker rmi <postgres_image> <pushkin_image>'

Stop the Push Notification service and PostgreSQL:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run 'echo "y" | gluster volume stop pushkin'
salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run 'echo "y" | gluster volume stop postgresql'

Delete the volumes:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run 'echo "y" | gluster volume delete pushkin'
salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run 'echo "y" | gluster volume delete postgresql'

Remove the directories for GlusterFS bricks:

salt -C 'I@docker:swarm and I@prometheus:server' cmd.run 'rm -rf /srv/glusterfs/pushkin'
salt -C 'I@docker:swarm and I@prometheus:server' cmd.run 'rm -rf /srv/glusterfs/postgres'

(Optional) Uninstall GlusterFS:
```
sudo apt remove -y glusterfs-server
```

Customize alerts for notifications¶

Once configured, StackLight LMA sends notifications about all alerts. However, you can disable a particular alert or change its default notification route. For example, you can configure StackLight LMA to send email notifications about one alert and Salesforce notifications about the other one.

To customize alerts for notifications:

Log in to the Salt Master node.
Open the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model for editing.
Edit the required alert:
1. Type the alert name.
2. Set the route parameter to salesforce or email. Alternatively, leave the route parameter empty to disable notifications for the specified alert.
  
  Example:
```
prometheus:
  server:
    alert:
      AlertName:
        labels:
          route: salesforce
```

Apply the Salt formula:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1

Enable notifications filtering¶

You can configure StackLight LMA to filter notifications and send the particular ones to specified notification channels. Starting from the maintenance update 2019.2.9, you can also configure notifications subroutes.

Enable notifications filtering¶

You can enable StackLight LMA to filter notifications and send the particular ones to specified notification channels. For example, you can configure StackLight LMA to send the CRITICAL notifications to email and WARNING notifications to Slack.

To enable notifications filtering:

Log in to the Salt Master node.
Open the classes/cluster/<cluster_name>/stacklight/server.yml file of the Reclass model for editing.

Specify the label and value parameters in the match_re section for a particular receiver.

Examples:

To send the CRITICAL notifications to email:

parameters:
  prometheus:
    alertmanager:
      enabled: true
      config:
        route:
          routes:
            email:
              receiver: SMTP
              match_re:
                - label: route
                  value: '(.*email.*)'
                - label: severity
                  value: critical
              continue: true

To send the CRITICAL and WARNING notifications to Slack:

parameters:
  prometheus:
    alertmanager:
      enabled: true
      config:
        route:
          routes:
            slack:
              receiver: HTTP-slack
              match_re:
                - label: route
                  value: '(.*slack.*)'
                - label: severity
                  value: critical|warning
              continue: true

Apply the changes:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1

Configure notifications subroutes¶

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can configure subroutes for the required notifications. For example, you can configure StackLight to send some notifications to email while also selecting the ones from these that have the CRITICAL severity and sending them to PagerDuty as well.

The changes performed within the following procedure are backward compatible with previous MCP releases and will not affect any of the already configured Alertmanager receivers and routes.

To configure notifications subroutes:

Open the classes/cluster/<cluster_name>/stacklight/server.yml file of the Reclass model for editing.

Configure the the subroute as required:

parameters:
  prometheus:
    alertmanager:
      enabled: true
      config:
        route:
          routes:
            <route name>:
              receiver: <receiver name>
                [...]
              routes:  # indicates the subroutes to configure
                <subroute_name>:
                  receiver: '<subroute receiver name>'
                    [...]
               [...]

For example, to send the Apache service alerts to email but filter the ones with CRITICAL severity and send them to PagerDuty:

parameters:
  prometheus:
    alertmanager:
      enabled: true
      config:
        route:
          routes:
            email_alerts:
              receiver: team-X-email
              match_re:
                - label: service
                  value: apache
              routes:
                pager_alerts:
                  match:
                    severity: critical
                  receiver: team-X-pager
              continue: true

Log in to the Salt Master node.
Verify the Reclass model:
```
reclass -i
```
The output should not include any errors.

Verify that the changes render properly:

salt '*mon*' state.sls prometheus.alertmanager test=True

Example of system response for the configuration provided above:

[...]
route:
  receiver: default
[...]
  routes:
  # email_alers
  - receiver: team-X-email
    continue: True
    match:
      service: apache
      routes:
      # pager_alerts
      - receiver: team-X-pager
        match:
          severity: critical
[...]

Apply the changes:

salt '*mon*' state.sls prometheus.alertmanager

Configure multiple emails for Alertmanager notifications¶

By default, you can set only one email for notifications through Alertmanager during the deployment model creation. However, you can configure Alertmanager to send notifications to multiple emails as required.

To configure multiple emails for Alertmanager notifications:

Open your Git project repository with Reclass model on the cluster level.

In the classes/cluster/cluster_name/stacklight/server.yml file, specify the emails as required, for example, by splitting the alerts by severity as shown below.

Example:

parameters:
  prometheus:
    alertmanager:
      enabled: true
      config:
        route:
          routes:
            email-common:
              receiver: SMTP-common
              continue: true
            email-critical:
              receiver: SMTP-critical
              match_re:
                - label: severity
                  value: critical
              continue: true
        receiver:
          SMTP-common:
            enabled: true
            email_configs:
              smtp_server:
                to: common@email.com
                from: ${_param:alertmanager_notification_email_from}
                auth_username: ${_param:alertmanager_notification_email_username}
                auth_password: ${_param:alertmanager_notification_email_password}
                smarthost: ${_param:alertmanager_notification_email_host}
                require_tls: ${_param:alertmanager_notification_email_require_tls}
                send_resolved: true
          SMTP-critical:
            enabled: true
            email_configs:
              smtp_server:
                to: critical@email.com
                from: ${_param:alertmanager_notification_email_from}
                auth_username: ${_param:alertmanager_notification_email_username}
                auth_password: ${_param:alertmanager_notification_email_password}
                smarthost: ${_param:alertmanager_notification_email_host}
                require_tls: ${_param:alertmanager_notification_email_require_tls}
                send_resolved: true

Apply the following state:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.alertmanager -b 1

See also

Alertmanager documentation

Enable notifications through the Push Notification service¶

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

This section describes how to enable StackLight LMA to send notifications to email, Salesforce or to both notification channels using the Push Notification service of the DevOps Portal.

To enable StackLight LMA to send notifications through the Push Notification Service:

Log in to the Salt Master node.
Open the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model for editing.

Add the following classes:

classes:
 [...]
- system.prometheus.alertmanager.notification.pushkin
- system.prometheus.server.alert.labels_add.route

Define the following variables:

parameters:
  _param:
    alertmanager_notification_pushkin_host: ${_param:haproxy_pushkin_bind_host}
    alertmanager_notification_pushkin_port: ${_param:haproxy_pushkin_bind_port}
    alertmanager_notification_pushkin_host: 172.16.10.101
    alertmanager_notification_pushkin_port: 16666

Set one or multiple notification channels by using the _param:prometheus_server_alert_label_route parameter. The default value is email, which means that email notifications will be sent.

Example:
```
parameters:
  _param:
    prometheus_server_alert_label_route: email;salesforce;
```

Configure email or Salesforce integration:

For deployments with the DevOps Portal, follow the procedure described in MCP Deployment Guide: Configure Salesforce integration for OSS manually or MCP Deployment Guide: Configure email integration for OSS manually.

For deployments without the DevOps Portal, modify the classes/cluster/cluster_name/stacklight/server.yml file to configure the Push Notification service, including:

Docker stack for the Push Notification service
PostgreSQL database for the Push Notification service
GlusterFS volumes to synchronize the data between monitoring nodes

Example:

classes:
  [...]
#Glusterfs configuration for Push Notification Service
- system.linux.system.repo.glusterfs
- system.glusterfs.client.cluster
- system.glusterfs.server.cluster
- system.glusterfs.server.volume.postgresql
- system.glusterfs.server.volume.pushkin
- system.glusterfs.client.volume.postgresql
- system.glusterfs.client.volume.pushkin

#Docker stack and network configurations
- system.docker.swarm.stack.pushkin
- system.docker.swarm.stack.postgresql
- system.docker.swarm.network.oss_backend

# Haproxy for Push Notification service
- system.haproxy.proxy.listen.oss.pushkin
- system.haproxy.proxy.listen.oss.postgresql

# Postgresql configuration for Push Notification Service
- system.postgresql.client.pushkin
- system.postgresql.client.sfdc

parameters:
  _param:
    [...]
#Glusterfs configuration for Push Notification Service
    glusterfs_service_host: ${_param:stacklight_monitor_address}
    glusterfs_node01_address: ${_param:stacklight_monitor_node01_address}
    glusterfs_node02_address: ${_param:stacklight_monitor_node02_address}
    glusterfs_node03_address: ${_param:stacklight_monitor_node03_address}

# Postgresql configuration for Push Notification Service
    postgresql_client_user: 'postgres'
    postgresql_client_password: 'postgrespassword'

# Email configuration for Push Notification Service
    pushkin_smtp_host: smtp.gmail.com
    pushkin_smtp_port: 587
    webhook_from: your_sender@mail.com
    pushkin_email_sender_password: your_sender_password
    webhook_recipients: "recepient1@mail.com,recepient2@mail.com"
    webhook_login_id: 14
    webhook_application_id: 4

# Salesforce configuration for Push Notification Service
    sfdc_auth_url: ''
    sfdc_username: ''
    sfdc_password: ''
    sfdc_consumer_key: ''
    sfdc_consumer_secret: ''
    environment: ''
    environment_id: ''
    sfdc_environment_id: ''
    sfdc_organization_id: ''
    sfdc_sandbox_enabled: False

# Alertmanager configuration for Push Notification Service
    alertmanager_notification_pushkin_host: ${_param:stacklight_monitor_address}
    alertmanager_notification_pushkin_port: 8887

Note

For Salesforce parameters definition, see MCP Deployment Guide: OSS parameters.

Apply the Salt formulas:

salt -C 'I@glusterfs:server' state.sls glusterfs.server.service
salt -C 'I@glusterfs:server' state.sls glusterfs.server.setup
salt -C 'I@glusterfs:client' state.sls glusterfs.client
salt -C 'I@docker:swarm:role:master and I@prometheus:server' state.sls docker.client
salt -C 'I@postgresql:client' state.sls postgresql.client
salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.alertmanager -b 1

Use the Prometheus web UI¶

The Prometheus web UI enables you to view simple graphs, Prometheus configuration and rules, and the state of the monitoring endpoints. This section describes how to use the Prometheus web UI.

Connect to the Prometheus web UI¶

This section describes how to access the Prometheus web UI.

To connect to the Prometheus web UI:

Obtain the Prometheus hostname (IP address), port, and protocol to use as a URL:

salt -C 'I@horizon:server' pillar.item nginx:server:site:nginx_proxy_prometheus_server:host

Example of system response:

prx01.stacklight-dev-2019-2-9.local:
    ----------
    nginx:server:site:nginx_proxy_prometheus_server:host:
        ----------
        name:
            10.13.0.80
        port:
            15010
        protocol:
            https

Enter the URL obtained in the step 2 to a web browser.

Note

Starting from MCP 2019.2.7, to access the Prometheus web UI from an external network, obtain the credentials by running the following commands from the Salt Master node:

salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_server_user
salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_server_password

For more details, see Configure authentication for Prometheus and Alermanager.

View graphs and alerts¶

Using the Prometheus web UI, you can view simple graphs for such metrics as cluster CPU allocation or memory used. Additionally, you can view alerts.

To view graphs:

Connect to the Prometheus web UI as described in Connect to the Prometheus web UI.
Select the required metric from the drop-down list. Alternatively, enter the metric in the Expression field.
Click Execute.
Navigate to the Graph tab to view the selected graph. If required, specify the time range.
Navigate to the Console tab to view the elements and their values.

Note

To view multiple graphs, click Add Graph and follow steps 2-4.

Note

The graphs disappear once you reload the page.

To view alerts:

Connect to the Prometheus web UI as described in Connect to the Prometheus web UI.
For a list of alerts, navigate to Alerts. Red alerts are the enabled ones.
Click on a red alert to view the details on the metric that raised the alert.

Note

The Active Since column displays the time since an alert has fired but does not display the time prior to the start of the Prometheus service itself. Therefore, if you restart the Prometheus service while having some alerts in the firing state, the Active Since column will display the Prometheus service restart time for these alerts instead of the original value.

View Prometheus settings¶

Using the Prometheus web UI, you can view the Prometheus settings, rules, monitoring endpoints and their state, and so on.

To view the settings:

Connect to the Prometheus web UI as described in Connect to the Prometheus web UI.
Select the required item from the Status drop-down list.

Use the Alertmanager web UI¶

The Alertmanager web UI enables you to view the most recent fired alerts and silence them, as well as view the Alertmanager configuration.

The Alertmanager web UI provides the following functionality:

The Silences tab displays the existing silences and allows you to define new ones for particular alerts.
The Alerts tab displays the alerts in the fired state.
The Status tab displays the Alertmanager status and settings.

To connect to the Alertmanager web UI:

Obtain the Alertmanager hostname (IP address), port, and protocol to use as a URL:

salt -C 'I@horizon:server' pillar.item nginx:server:site:nginx_proxy_prometheus_alertmanager:host

Example of system response:

prx01.stacklight-dev-2019-2-9.local:
    ----------
    nginx:server:site:nginx_proxy_prometheus_alertmanager:host:
        ----------
        name:
            10.13.0.80
        port:
            15011
        protocol:
            https

Enter the URL obtained in the step 2 to a web browser.

Note

Starting from MCP 2019.2.7, to access the Alertmanager web UI from an external network, obtain the credentials by running the following commands from the Salt Master node:

salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_alertmanager_password
salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_alertmanager_user

For more details, see Configure authentication for Prometheus and Alermanager.

Use the Alerta web UI¶

The Alerta web UI enables you to view the most recent or watched alerts, as well as group and filter alerts according to your needs.

To connect to the Alerta web UI:

Obtain the Alerta hostname (IP address), port, and protocol to use as a URL:

salt -C 'I@horizon:server' pillar.item nginx:server:site:nginx_proxy_alerta:host

Example of system response:

prx01.stacklight-dev-2019-2-9.local:
    ----------
    nginx:server:site:nginx_proxy_alerta:host:
       ----------
       name:
           10.13.0.80
        port:
           15017
        protocol:
           https

Obtain the Alerta administrator password:

Starting from the MCP 2019.2.11 maintenance update:

salt -C 'I@prometheus:alerta and I@docker:swarm:role:master' pillar.item _param:alerta_admin_password

Example of system response:

mon01.queens-ovs-stacklight-2k19-2-om.local:
    ----------
    _param:alerta_admin_password:
        zbfNxydJnTwzgA61CRmrTiZ9w4Gmt5Qf

Prior to the MCP 2019.2.11 maintenance update:

salt -C 'I@prometheus:alerta and I@docker:swarm:role:master' pillar.item docker:client:stack:monitoring:service:alerta:environment:ADMIN_PASSWORD

Example of system response:

mon01.stacklight-dev-2019-2-9.local:
    ----------
    docker:client:stack:monitoring:service:alerta:environment:ADMIN_PASSWORD:
        miUsex85ldx9YZLw3JBRegcYialYCFMA

Enter the URL obtained in the step 2 to a web browser.
Log in to the Alerta web UI using the default admin@alerta.io user name and the password obtained in the step 3.

Use Grafana¶

Grafana is a web service that builds and visually represents metric graphs based on time series databases. A collection of predefined Grafana dashboards contains graphs on particular endpoints.

In Prometheus-based StackLight LMA, Grafana has the following special aspects:

Grafana dashboards do not include annotations based on alerts. To view the alerts, use the Prometheus or Alertmanager web UI. Use Grafana only to visualize the data for a selected period instead of cluster monitoring.
Grafana dashboards display only the existing information about nodes, disks, interfaces, and so on, according to a particular time frame, which is set to one hour by default. For example, if a node is offline for 30 minutes and the time frame is set to five minutes, the dashboards will not display this node in the drop-down menus. In this case, see the list of alerts in the Prometheus or Alertmanager web UI.

Warning

Most OpenStack dashboards include the API Availability panel that displays only the OK or DOWN states and does not display the warning states. For warning states, use the Prometheus or Alertmanager web UI.

This section describes how to connect to Grafana, view the available dashboards, and so on.

Connect to Grafana¶

To access Grafana, use the public IP exposed by the proxy node, the default 8084 port, and the HTTPS protocol.

To connect to Grafana:

Obtain the Grafana hostname (IP address), port, and protocol to use as a URL:

salt -C 'I@horizon:server' pillar.item nginx:server:site:nginx_proxy_grafana:host

Example of system response:

prx01.stacklight-dev-2019-2-9.local:
    ----------
    nginx:server:site:nginx_proxy_grafana:host:
        ----------
        name:
            10.13.0.80
        port:
            8084
        protocol:
            https

Obtain the Grafana administrator password:

Starting from the MCP 2019.2.11 maintenance update:

salt -C 'I@grafana:client and I@docker:swarm:role:master' pillar.item _param:grafana_password

Example of system response:

mon01.queens-ovs-stacklight-2k19-2-om.local:
    ----------
    _param:grafana_password:
        jrRk4Fa6pxrjq7hbVeqm2drz6Ezgpr4m

Prior to the MCP 2019.2.11 maintenance update:

salt -C 'I@grafana:client and I@docker:swarm:role:master' pillar.item docker:client:stack:dashboard:service:grafana:environment:GF_SECURITY_ADMIN_PASSWORD

Example of system response:

mon01.stacklight-dev-2019-2-9.local:
    ----------
    docker:client:stack:dashboard:service:grafana:environment:GF_SECURITY_ADMIN_PASSWORD:
        Z06IsBRBY4h67PCczp24vERqh7fGY3dm

Enter the URL obtained in the step 2 to a web browser.
Log in to the Grafana web UI using the default admin user name and the password obtained in the step 3.

Once done, select the required dashboard from the Home drop-down list. For details about the available Grafana dashboards, see View Grafana dashboards. The following is an example of a Grafana dashboard:

View Grafana dashboards¶

This section describes the available Grafana dashboards.

Main dashboard¶

The Main dashboard displays the statuses of the services deployed on the cluster. For example, the statuses of OpenStack-related services such as Cinder, Glance, Nova, and so on. Using the Main dashboard, you can quickly determine if any of the services are down or require attention. The Main dashboard consists of the following sections:

The Middleware section provides the statuses of the services which are not part of OpenStack.
The OpenStack Control Plane section provides the statuses of OpenStack-related services.

The services may have one of the following statuses:

UP if the service is up and running
WARN if the service is down on less than half of the nodes
CRIT if the service is down on more than half of the nodes
DOWN if the service is down on all nodes
No value if the metric is not available
UNKW if the response value is higher than 1, meaning that the status of the service is unknown

Click a required service to open its dashboard with all metrics. Alternatively, click the arrow sign on a particular service to open its dashboard in a separate tab.

Kubernetes dashboards¶

This section describes the Kubernetes-related dashboards available in Grafana.

Dashboard	Description
Calico cluster monitoring	Displays the entire Calico cluster usage, including the cluster status, host status, Bird and Felix resources.
Etcd cluster	Provides the cluster status and an overview of the cluster behavior (raft) and the usage of the Etcd instances.
Kubernetes cluster monitoring	Provides metrics related to the entire Kubernetes cluster, including the cluster status, host status, and the resources consumption. Note For the deployments with OpenContrail, the Kubernetes cluster monitoring dashboard does not include the Proxy status panel.

OpenStack dashboards¶

This section describes Grafana dashboards for OpenStack that you can use to explore different time-series facets of your OpenStack environment.

Dashboard	Description
Cinder	Provides a detailed view of the OpenStack Block Storage service metrics, such as the cluster status, host API status, API performance, Cinder services, and resources usage.
Glance	Provides a detailed view of the OpenStack Image service metrics, including the overall health status of the Glance service cluster, host API status, API performance, and various indicators of the virtual resources usage, such as the number of images and snapshots, their status, size, and visibility.
Heat	Provides a detailed view of the OpenStack Orchestration service, including the cluster status, host API status, and API performance metrics.
Ironic ^{Available since 2019.2.6}	Note This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates. Provides a detailed overview of the OpenStack Bare Metal Provisioning service, including the cluster status, host API status, Ironic services, nodes, and drivers metrics.
Keystone	Provides information about the OpenStack Identity service, including the cluster status, host API status, API performance, and the number of active and disabled Keystone users and tenants.
Neutron	Provides a detailed view of the OpenStack networking service, including the cluster status, host API status, API performance, and resources consumption.
Nova dashboards	Nova - availability zones is available starting from the MCP 2019.2.8 maintenance update and provides the OpenStack availability zones statistics such as the availability zone and hypervisor usage, including the CPU, RAM, and disk usage. Nova - overview provides an overview of the cluster status, host API status, API performance, and the status of Nova services. Nova - hypervisor overview displays the CPU, RAM, and disk usage by particular hosts. Nova - utilization displays the general information on CPU, RAM, and disk usage, as well as the aggregate and hypervisors usage. Nova - instances: Prior to the MCP 2019.2.7 maintenance update, provides the host usage information, such as the number of running instances and tasks, and the instance usage, such as the CPU, memory usage, and so on. Starting from the MCP 2019.2.7 maintenance update, provides a more comprehensive information about instances, such as the CPU, RAM, disk throughput usage and allocation and allows sorting the metrics by top instances. Nova - users and Nova - tenants dashboards are available starting from the MCP 2019.2.7 maintenance update and provide information about CPU, RAM, disk throughput, IOPS, and space usage and allocation and allow sorting the metrics by top users or tenants.
Octavia	Provides a detailed view of the OpenStack Octavia service that is coupled with the Neutron LBaaS to display the load balancing state of your OpenStack environment. The dashboard displays the Octavia API availability and the number of resources for Octavia/LBaaS, including the load balancers, listeners, pools, members, and monitors.
OpenStack overview	Provides a detailed view of your MCP OpenStack environment, including the cloud usage metrics, API errors, and allocations per aggregate.
OpenStack Tenants ^{Removed since 2019.2.7}	Note Starting from the MCP 2019.2.7 update, the dashboard has been removed in favor of the Nova - users and Nova - tenants dashboards that are more informative. Provides a detailed view of Nova instances usage per tenants and users, including the CPU and memory usage, disks I/O, and network RX/TX.

OpenContrail dashboards¶

This section describes the dashboards that provide metrics for the OpenContrail services deployed on the platform.

Dashboard	Description
Cassandra	Provides metrics about the Cassandra service, including the number of Cassandra endpoints, the number of native and thrift clients, information about the clients requests, the compaction engine rates, the storage metrics, as well as the heap memory and memory pool usage of the Cassandra JVM.
OpenContrail Controller	Displays the overall status of the OpenContrail APIs, the number of OpenContrail API sessions, the host status, and the number of BGP and vRouters Extensible Messaging and Presence Protocol (XMPP) sessions that are in the `up` and `down` state. Note For OpenContrail v3.2, the dashboard additionally includes the number of OpenContrail Discovery API servers.
OpenContrail vRouter	Displays the vRouter statistics, such as the state and total number of vRouters, service status, the number of vRouters Extensible Messaging and Presence Protocol (XMPP) and vRouters Link-Local Services (LLS) sessions, as well as the number of active and aged vRouter flows and vRouter errors. Note For OpenContrail v4.x, the dashboard does not include the DNS-XMPP metrics.
Zookeeper	Provides a detailed view of the ZooKeeper service, including the overall status of the cluster and the statistics metrics, such as the latency, packets, ephemerals, approximate data size, alive connections, and so on.

KPI dashboards¶

This section describes the Key Performance Indicator (KPI) dashboards that provide an overview of the infrastructure stability.

Dashboard	Description
KPI Downtime	Provides an overview of the instances availability, such as average uptime of VMs, the number of available VMs, the VMs uptime percentage, instances in the `ERROR` state, and instances in the `ACTIVE` state but not responding.
KPI Provisioning	Provides the percentage of instance provisioning failures based on the `compute.instance.create.start`, `compute.instance.create.end`, and `compute.instance.create.error` Nova notifications.

StackLight LMA components dashboards¶

This section describes the dashboards that provide metrics for the StackLight LMA components.

Dashboard	Description
Alertmanager	Provides performance metrics for the Prometheus Alertmanager service, including the overall health status of the service, the number of firing and resolved alerts received for various time periods, the rate of successful and failed notifications, and the resources consumption.
Elasticsearch	Provides information about the overall health status of the Elasticsearch cluster, including the state of the shards and various metrics of resources consumption.
Grafana	Provides performance metrics for the Grafana service, including the total number of Grafana entities, CPU and memory usage.
InfluxDB ^Deprecated	Warning InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release. Provides statistics about the InfluxDB processes running in the InfluxDB cluster including various metrics of resources consumption.
InfluxDB Relay ^Deprecated	Warning InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release. Provides the overall health status of the InfluxDB Relay cluster and metrics for the InfluxDB Relay instances and backends.
Kibana	Provides performance metrics for the Kibana service, including the service status and resources consumption, such as the number of threads, CPU, and memory usage.
Prometheus Performances	Provides metrics for the Prometheus component itself, such as the sample ingestion rate and system usage statistics per server. This enables you to check the availability and performance behavior of the Prometheus servers.
Prometheus Relay	Provides metrics for the Prometheus Relay component itself, including the service status and resources consumption.
Prometheus Stats	Provides statistics about the Prometheus server and includes the overall status of the Prometheus service, the uptime of the Prometheus service, the chunks number of the local storage memory, target scrapes, queries duration, and so on.
Pushgateway	Provides performance metrics for the Prometheus Pushgateway service, including the overall health status of the service, the rate of samples received for various time periods, and the resources consumption.
Remote Storage Adapter ^Deprecated	Warning InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release. Provides the overall status of the remote storage adapter service, including the number of sent, received, and ignored samples and the resources consumption.

Ceph dashboards¶

The Ceph dashboards provide a detailed view of the Ceph cluster, hosts, OSDs, RADOS Gateway instances, and pools.

Dashboard	Description
Ceph Cluster	Provides metrics related to the Ceph cluster, including the overall health status of the Ceph service cluster, capacity, latency, and recovery metrics.
Ceph Hosts Overview	Provides an overview of the host-related metrics, such as the number of monitors, OSD hosts, average usage of resources across the cluster, network and hosts load.
Ceph OSD device details	Provides metrics related to the Ceph OSD and physical device performance.
Ceph OSD Overview	Provides metrics related to Ceph OSDs, including the OSD read and write latencies and the distribution of PGs per OSD. Note Starting from the MCP 2019.2.5 maintenance update, the Distribution of PGs per OSD panel displays the data in bars instead of lines.
Ceph Pools Overview	Provides metrics for Ceph pools, including the client IOPS and throughput by pool and pools capacity usage.
Ceph RGW Instance Details	Provides detailed graphs on the RADOS Gateway host.
Ceph RGW Overview	Provides metrics related to the RADOS Gateway instances, including the latencies, requests, and bandwidth.
Ceph RBD Overview ^{Available since 2019.2.10. TechPreview.}	Provides metrics related to RADOS Block Device images, including the throughput, latencies, and IOPS. Note Optional, disabled by default. Available as technical preview starting from the MCP 2019.2.10 maintenance update and requires Ceph Nautilus. For details, see Enable RBD monitoring.

Operating system dashboards¶

This section describes the dashboards that provide a detailed view of the operating system.

Dashboard	Description
Bond	Provides the status of Linux bond interfaces, such as the count of bond slave failures, bond status, and bond slave status.
NTP	Provides metrics about the Network Time Protocol(NTP) services on hosts.
System disk I/O	Provides a detailed overview of the file system metrics, including the used and free space and inodes, and series metrics, including the latency, I/O wait, and so on.
System networking	Displays the network-related metrics, such as the throughput, packets, errors, dropped packets on a selected interface, and information about network data processing.
System overview	Provides a detailed overview of the operating system metrics, including the overall status of the system, memory available, swap and CPU usage, and so on.

Support services dashboards¶

This section describes the dashboards for support services, such as RabbitMQ, HAProxy, MySQL, and so on.

Dashboard	Description
Apache	Displays the overall status of the Apache cluster and provides performance metrics for the Apache service, such as the number of requests, bytes transmitted, the number of connections, workers states, and current workers in the idle state.
Docker	Provides the Docker cluster status, the status of Docker containers, metrics about Docker images, as well as the performance metrics of the Docker host, such as the number of threads, CPU, and disk I/O.
GlusterFS	Provides the operational status of the GlusterFS cluster and service, as well as the usage reporting of the shared volumes.
HAProxy	Provides the overall status of the HAProxy cluster and various metrics related to the front-end and back-end servers.
Jenkins	Provides general information about the Jenkins service, including the number of online and total Jenkins nodes, the queue size, JVM free memory and uptime, as well as various metrics on executors, jobs, and resources usage.
Keepalived	Provides metrics about the Keepalived service, such as the current status of the Keepalived instances, status for a specific period of time, process responsiveness, and the state of the Virtual Router Redundancy Protocol (VRRP) of Keepalived.
Memcached	Provides the overall status of the Memcached service, including the metrics on the Memcached servers, memory usage, operations and network metrics.
MySQL	Provides detailed information about the MySQL cluster status and service usage, including the cluster size, the average size of the receive and send queries, network I/O, locks, threads, queries, and so on.
Nginx	Provides metrics about the NGINX service, such as the overall status of the NGINX cluster and information about NGINX requests and connections.
RabbitMQ	Provides general information about the RabbitMQ cluster, including the host status, cluster statistics, and resources consumption. Note Starting from the MCP 2019.2.3 maintenance update, the Cluster stats section displays the Queued messages and Message rates panels instead of Messages and Cluster stats.

Hide nodes from dashboards¶

When you remove a node from the environment, it is still displayed in the host drop-down lists of the Grafana dashboards.

To hide a node from the list:

Connect to Grafana.
Navigate to the required dashboard.
Click the gear icon at the top left corner and select Templating.
In the Variables tab, click Edit.
Edit the Regex text box in the Query Options section. For example:
- To hide cfg01, add the following text:
```
^(?!cfg01$).+$
```
- To hide more than one node, add more conditions:
```
^(?!cfg01$|cmp01$).+$
```
Click Update to apply the changes.

Example:

Add the Gnocchi data source to Grafana¶

By default, Grafana uses Prometheus as a data source to provide graphs and charts. If your OpenStack version is Pike or newer and you have deployed Tenant Telemetry as described in MCP Deployment Guide: Deploy Tenant Telemetry, you can also add Gnocchi as the data source for Grafana. This allows StackLight LMA gather the data both from Prometheus and from Gnocchi, the Tenant Telemetry database, for further processing and displaying through Grafana dashboards.

Prerequisites¶

Before you add Gnocchi as a data source to Grafana, verify that your environment meets the following requirements.

Due to a limitation in Grafana, Gnocchi and Keystone must be accessible through the same HTTP vhost.
Keystone must return public endpoints for Gnocchi and for its own services.
The public endpoint must be resolvable from Grafana and your browser.
Cross-Origin Resource Sharing (CORS) must be enabled in Gnocchi and Keystone to allow requests from Grafana. Additionally, CORS must be enabled in your browser.

To satisfy the requirements, complete the following prerequisite steps:

Enable CORS for the Gnocchi and Keystone servers:
1. Open your project Git repository with the Reclass model on the cluster level.
2. In the openstack/telemetry.yml file, add the following configuration for Gnocchi, specifying the vhost FQDN as allowed_origin:
```
gnocchi:
  server:
    cors:
      allowed_origin: "http://example.com"
```
3. In the openstack/control.yml file, add the following configuration for Keystone, specifying the vhost FQDN as allowed_origin:
```
keystone:
  server:
    cors:
      allowed_origin: "http://example.com"
```

Set the endpoints for Keystone and Gnocchi:

In the openstack/init.yml file, specify the following parameters:

_params:
    gnocchi_public_host: example.com
    gnocchi_public_port: 80
    gnocchi_public_path: '/metric'
    keystone_public_path: '/identity'
    keystone_public_address: example.com
    keystone_public_port: 80

Apply the following states:

salt 'ctl*' state.sls keystone
salt -C 'I@keystone:client' state.sls keystone.client
salt 'mdb*' state.sls gnocchi

Configure the same HTTP host for Gnocchi and Keystone:

Create a new file /etc/nginx/sites-enabled/nginx_proxy_gnocchi.conf for the vhost configuration for Gnocchi and specify the following parameters:

server {
    listen 80 default_server;
    server_name _;

    location ~ ^/metric/?(.*) {
        proxy_pass      http://172.16.10.250:8041/$1;

        proxy_pass_request_body on;
        proxy_set_header        Host            $host;
        proxy_set_header        X-Real-IP       $remote_addr;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location ~ ^/identity/?(.*) {
        proxy_pass http://172.16.10.254:5000/$1;

        proxy_pass_request_body on;
        proxy_set_header        Host            $host;
        proxy_set_header        X-Real-IP       $remote_addr;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Verify the new configuration and restart NGINX:

salt -C 'I@nginx:server' cmd.run 'nginx -t && nginx -s reload'

Make the HTTP host resolvable from the browser through an internal DNS or a static configuration on hosts, for example, /etc/hosts in Linux.

For a static configuration, run the following command in your console to add a record in /etc/hosts, where example.com is the vhost FQDN:
```
echo "<grafana_public_ip_address> example.com" >> /etc/hosts
```
Enable CORS requests in your browser or install a corresponding browser extension. For example, use the Chrome extension or Firefox extension.

Once done, proceed to Add the Gnocchi data source.

Add the Gnocchi data source¶

Once you perform the steps described in Prerequisites, perform the steps below to add the Gnocchi data source to Grafana.

To add the Gnocchi data source:

Open your Git project repository with Reclass model on the cluster level.
In the stacklight/client.yml file, add the following class:
```
- system.grafana.client.datasource.gnocchi
```

In the stacklight/server.yml file, specify the following parameters, where example.com is the vhost FQDN:

docker_image_grafana: docker-prod-local.artifactory.mirantis.com/openstack-docker/grafana:stable
grafana_gnocchi_address: example.com

Apply the following states:

salt 'mon*' state.sls docker.client
salt 'mon*' state.sls grafana.client

If the HTTP vhost name is not resolvable from the Docker container, run the following command from the host that runs the container to add a record in /etc/hosts of Grafana containers:

GRAFANA_CONTAINER_ID=`docker ps | grep grafana | cut -d" " -f1`
HOST_IP=`hostname -I | cut -d " " -f1`
docker exec -u root -it $GRAFANA_CONTAINER_ID bash -c "echo \"$HOST_IP example.com\" >> /etc/hosts"

In the Grafana web UI, navigate to Configuration > Data Sources.
Open the Gnocchi data source and click Save & Test.

Use Kibana¶

Kibana is used for log and time series analytics. Kibana provides real time visualization of the data stored in Elasticsearch and allows you to diagnose issues. This section describes how to connect to Kibana and use its dashboards.

Connect to Kibana¶

To access Kibana, use the public VIP exposed by the proxy node, the default 5601 port, and the HTTPS protocol.

To connect to Kibana:

Obtain the Kibana hostname (IP address), port, and protocol to use as a URL:

salt -C 'I@horizon:server' pillar.item nginx:server:site:nginx_proxy_kibana:host

Example of system response:

prx01.stacklight-dev-2019-2-9.local:
----------
nginx:server:site:nginx_proxy_kibana:host:
    ----------
    name:
        10.13.0.80
    port:
        5601
    protocol:
        https

Enter the URL obtained in the step 2 to a web browser.

No credentials are required to connect to Kibana.

Manage Kibana dashboards¶

Kibana contains the following built-in dashboards:

The Logs analytics dashboard that is used to visualize and search the logs.
The Notifications analytics dashboard that is used to visualize and search the notifications. This dashboard is available if you enable the feature in the Collector settings.
The Audit analytics dashboard that is used to visualize and search for the OpenStack CADF notifications.

To switch from one dashboard to another, click Dashboard and select the required one as shown in the screen capture below:

Each dashboard provides a single pane of glass for visualizing and searching for the logs and notifications of your deployment.

The Kibana dashboard for logs is divided into several sections:

A time-picker control to select the required time period and refresh frequency.
A text box to enter search queries.
The logs analytics with six different panels showing the following stack graphs:
1. All logs per source
2. All logs per severity
3. All logs for top 10 sources
4. All logs for top 10 programs
5. All logs for top 10 hosts
6. The number of logs per severity
The table of log messages is sorted in the reverse chronological order.

Use Kibana filters and queries¶

Filters and queries have similar syntax but are used for different purposes:

Filters are used to restrict what is displayed in the Kibana dashboard.
Queries are used for free-text search.

You can combine multiple queries and compare the results. You can also further filter the log messages. For example, to select the Hostname filter:

Expand a log entry.
Select the Hostname field by clicking on the magnifying glass icon as follows:

This will apply a new filter in the Kibana dashboard:

Filtering works for any field that has been indexed for the log entries that are present in the Kibana dashboard.

Filters and queries can also use wildcards that can be combined with the field names like in Logger.keyword: <name>. For example, to display only the Nova logs, enter Logger.keyword:openstack.nova in the query text box as follows:

StackLight LMA alerts¶

This section provides a detailed overview of the available StackLight LMA alerts including their customization capabilities and troubleshooting recommendations, as well as describes the alerts that require post-deployment configuration. This section also provides an instruction on how to generate the list of alerts for a particular MCP deployment.

Available StackLight LMA alerts¶

This section describes the available StackLight LMA alerts grouped by services.

Core services¶

This section describes the alerts available for the core services.

Apache¶

This section describes the alerts for the Apache service.

ApacheServiceDown
ApacheServiceOutage
ApacheWorkersAbsent

ApacheServiceDown¶

Severity	Minor
Summary	The Apache service on the `{{ $labels.host }}` node is down.
Raise condition	`apache_up != 1`
Description	Raises when the Apache service on a particular host does not respond to the Telegraf service, typically meaning that the Apache service is down or misconfigured on that host. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the Apache service status by running `systemctl status apache2` on the affected node. Inspect the Telegraf logs by running `journalctl -u telegraf` on the affected node.
Tuning	Not required

ApacheServiceOutage¶

Severity	Critical
Summary	All Apache services within the `{{ $labels.cluster }}` cluster are down.
Raise condition	`count(label_replace(apache_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(apache_up == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)`
Description	Raises when all Apache services across the cluster do not respond to the Telegraf plugin, typically indicating some global deployment or configuration issues. The `cluster` label in the raised alert is a set of nodes with the same host name prefix, for example, `mon`, `cid`.
Troubleshooting	Verify the Apache service status by running `systemctl status apache2` on the affected node. Inspect the Telegraf logs by running `journalctl -u telegraf` on the affected node.
Tuning	Not required

ApacheWorkersAbsent¶

Severity	Minor
Summary	The Apache service on the `{{ $labels.host }}` node has no available workers for 2 minutes.
Raise condition	`apache_IdleWorkers == 0`
Description	Raises when the Apache service on a particular host has no free idle workers.
Troubleshooting	Increase the `MaxClients` Apache parameter value using the `apache.server.mpm.servers.max_requests` pillar.
Tuning	Not required

Bond interfaces¶

This section describes the alerts for bond interfaces.

BondInterfaceDown
BondInterfaceSlaveDown
BondInterfaceSlaveDownMajor
BondInterfaceSingleSlave

BondInterfaceDown¶

Severity	Critical
Summary	The `{{ $labels.bond }}` bond interface on the `{{ $labels.host }}` node has all ifaces down.
Raise condition	`bond_status < 1`
Description	Raises when the bond network interface and all slave interfaces are down, typically indicating network interface misconfiguration or issues with slave interfaces. The `host` and `bond` labels in the raised alert contain the host name of the affected node and the affected interface. For details on the bond interface, see the `/proc/net/bonding/{bond_name}` file on the affected node.
Tuning	Not required

BondInterfaceSlaveDown¶

Severity	Warning
Summary	The `{{ $labels.bond }}` bond interface slave `{{ $labels.interface }}` on the `{{ $labels.host }}` node is down.
Raise condition	`bond_slave_status < 1`
Description	Raises when the slave network interface of a bond interface is down, typically indicating a network misconfiguration. The `host`, `bond`, and `interface` labels in the raised alert contain the host name of the affected node, bond interface name, and the name of the interface slave. For details on the bond interface, see the `/proc/net/bonding/{bond_name}` file on the affected node.
Tuning	Not required

BondInterfaceSlaveDownMajor¶

Severity	Major
Summary	More than 50% of `{{ $labels.bond }}` bond interface slaves on the `{{ $labels.host }}` node are down.
Raise condition	`sum(bond_slave_status) by (bond,host) <= 0.5 * count(bond_slave_status) by (bond,host)`
Description	Raises when more than 50% of slave network interfaces of a bond interface are down, typically indicating a network misconfiguration. The `host`, `bond`, and `interface` labels in the raised alert contain the host name of the affected node, bond interface name, and the name of the interface slave. For details on the bond interface, see the `/proc/net/bonding/{bond_name}` file on the affected node.
Tuning	Not required

BondInterfaceSingleSlave¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Major
Summary	The `{{ $labels.bond }}` bond interface on the `{{ $labels.host }}` node has only one slave.
Raise condition	`count(bond_slave_status) by (bond,host) == 1`
Description	Raises when the bond interface has only one slave, typically indicating a network misconfiguration since the bond interface must have at least two slave interfaces. The `host`, `bond`, and `interface` labels in the raised alert contain the host name of the affected node, bond interface name, and the name of the interface slave. For details on the bond interface, see the `/proc/net/bonding/{bond_name}` file on the affected node.
Tuning	Not required

Docker¶

This section describes the alerts for the Docker service.

DockerdProcessDown
DockerServiceOutage
DockerService {{ camel_case_name }} ReplicasDownMinor
DockerService {{ camel_case_name }} ReplicasDownMajor
DockerService {{ camel_case_name }} Outage
DockerdServiceReplicaFlapping

DockerdProcessDown¶

Severity	Minor
Summary	The `dockerd` process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="dockerd"} == 0`
Description	Raises when Telegraf cannot find running `dockerd` processes on a host. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify Docker status by running `systemctl status docker` on the affected node. Inspect Docker logs using `journalctl -u docker`.
Tuning	Not required

DockerServiceOutage¶

Severity	Critical
Summary	All `dockerd` processes within the `{{ $labels.cluster }}` cluster are down.
Raise condition	`count(label_replace(procstat_running{process_name="dockerd"}, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(procstat_running{process_name="dockerd"} == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)`
Description	Raises when Telegraf cannot find running `dockerd` processes on all hosts of a cluster. The `cluster` label in the raised alert is a set of nodes with the same host name prefix, for example, `mon` or `cid`.
Troubleshooting	Inspect the `DockerdProcessDown` alerts for the host names of the affected nodes. Verify the Docker service status using `service docker status`. Inspect Docker logs using `journalctl -u docker`.
Tuning	Not required

DockerService {{ camel_case_name }} ReplicasDownMinor¶

Severity	Minor
Summary	More than 30% of the Docker Swarm `{{ full_service_name }}` service replicas are down for 2 minutes.
Raise condition	In 2019.2.8 and prior: `{{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) >= {{ service.deploy.replicas }} * {{ monitoring.replicas_failed_warning_threshold_percent }}` In 2019.2.9 and newer: `min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) / min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{ 1 - monitoring.replicas_failed_warning_threshold_percent }}`
Description	A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services. Raises when the cluster has more than 30% of unavailable replicas. The `service_name` label in the raised alert contains the Docker service name.
Troubleshooting	Run `docker service ps <service_name>` on any node of the affected cluster to verify the Docker service.
Tuning	Not required

DockerService {{ camel_case_name }} ReplicasDownMajor¶

Severity	Major
Summary	More than 60% of the Docker Swarm `{{ full_service_name }}` service replicas are down for 2 minutes.
Raise condition	In 2019.2.8 and prior: `{{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) >= {{ service.deploy.replicas }} * {{ monitoring.replicas_failed_critical_threshold_percent }}` In 2019.2.9 and newer: `min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) / min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{ 1 - monitoring.replicas_failed_critical_threshold_percent }}`
Description	A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services. Raises when the cluster has more than 60% of unavailable service replicas. The `service_name` label in the raised alert contains the Docker service name.
Troubleshooting	Run `docker service ps <service_name>` on any node of the affected cluster to verify the Docker service.
Tuning	Not required

DockerService {{ camel_case_name }} Outage¶

Severity	Critical
Summary	All Docker Swarm {{ full_service_name }} replicas are down for 2 minutes.
Raise condition	In 2019.2.5 and prior: `docker_swarm_tasks_running{{ '{' + label_selector + '}' }} == 0 or absent (docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1` In 2019.2.6-2019.2.8: `docker_swarm_tasks_desired{{ '{' + label_selector + '}' }} > 0 and (docker_swarm_tasks_running{{ '{' + label_selector + '}' }} == 0 or absent (docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1)` In 2019.2.9 and newer: `min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) > 0 and (min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 0 or absent(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1)`
Description	A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services. Raises when the cluster has no available service replicas. The `service_name` label in the raised alert contains the Docker service name.
Troubleshooting	Run `docker service ps <service_name>` on any node of the affected cluster to verify the Docker service.
Tuning	Not required

DockerdServiceReplicaFlapping¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	The Docker Swarm `{{ $labels.service_name }}` service replica is flapping for 15 minutes.
Raise condition	`sum(changes(docker_swarm_tasks_running[10m])) by (service_name) > 0`
Description	Raises when the container with the service cannot start properly in the Docker Swarm cluster and stops after the start, meaning that the service may be unavailable. However, the service unavailability alert may not fire because the container is being constantly restarted.
Troubleshooting	Inspect the failed container logs using `docker logs <container_id>`.
Tuning	Not required

Galera¶

This section describes the alerts for the Galera cluster.

GaleraServiceDown
GaleraServiceOutage
GaleraNodeNotReady
GaleraNodeNotConnected

GaleraServiceDown¶

Severity	Minor
Summary	The Galera service on the `{{ $labels.host }}` node is down.
Raise condition	`mysql_up != 1`
Description	Raises when MySQL on a host does not respond to Telegraf, typically indicating that MySQL is not running on that node. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify the MySQL status on the affected node using `service mysql status`. If MySQL is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

GaleraServiceOutage¶

Severity	Critical
Summary	All Galera services within the `{{ $labels.cluster }}` cluster are down.
Raise condition	`count(label_replace(mysql_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(mysql_up == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)`
Description	Raises when all MySQL services across the cluster do not respond to Telegraf, typically indicating deployment or configuration issues.
Troubleshooting	Verify the MySQL status on any Galera node using `service mysql status`. If MySQL is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

GaleraNodeNotReady¶

Severity	Major
Summary	The Galera service on the `{{ $labels.host }}` node is not ready to serve queries for 1 minute.
Raise condition	`mysql_wsrep_ready != 1`
Description	Raises when the Write Set Replication (WSREP) in the MySQL service is not ready, typically indicating that the MySQL process is running but the WSREP is not in the `ready` state, meaning that the node is not a part of the Galera cluster.
Troubleshooting	Inspect the MySQL logs on the affected node using `journalctl -u mysql`.
Tuning	Not required

GaleraNodeNotConnected¶

Severity	Major
Summary	The Galera service on the `{{ $labels.host }}` node is not connected to the cluster for 1 minute.
Raise condition	`mysql_wsrep_connected != 1`
Description	Raises when the Write Set Replication (WSREP) in the MySQL service is not in the `connected` state, typically indicating that the MySQL process is running but the WSREP did not establish the required connections with other nodes within the Galera cluster due to the WSREP misconfiguration in MySQL or a network issue.
Troubleshooting	Inspect the MySQL logs on the affected node using `journalctl -u mysql`. Verify if proper hosts are used in Galera.
Tuning	Not required

GlusterFS¶

This section describes the alerts for the GlusterFS service.

GlusterfsServiceMinor
GlusterfsServiceOutage
GlusterfsInodesUsedMinor
GlusterfsInodesUsedMajor
GlusterfsSpaceUsedMinor
GlusterfsSpaceUsedMajor
GlusterfsMountMissing

GlusterfsServiceMinor¶

Severity	Minor
Summary	The GlusterFS service on the `{{ $labels.host }}` host is down.
Raise condition	`procstat_running{process_name="glusterd"} < 1`
Description	Raises when Telegraf cannot find running `glusterd` processes on the `kvm` hosts.
Troubleshooting	Verify the GlusterFS status using `systemctl status glusterfs-server`. Inspect GlusterFS logs in the `/var/log/glusterfs/` directory.
Tuning	Not required

GlusterfsServiceOutage¶

Severity	Critical
Summary	All GlusterFS services are down.
Raise condition	`glusterfs_up != 1`
Description	Raises when the Telegraf service cannot connect or gather metrics from the GlusterFS service, typically meaning the GlusterFS, Telegraf `monitoring_remote_agent` service, or network issues.
Troubleshooting	Inspect the Telegraf `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on any `mon` node. Verify the GlusterFS status using `systemctl status glusterfs-server`. Inspect GlusterFS logs in the `/var/log/glusterfs/` directory.
Tuning	Not required

GlusterfsInodesUsedMinor¶

Severity	Minor
Summary	More than 80% of GlusterFS `{{ $labels.volume }}` volume inodes are used for 2 minutes.
Raise condition	`glusterfs_inodes_percent_used >= {{ monitoring.inodes_percent_used_minor_threshold_percent100 }} and glusterfs_inodes_percent_used < {{ monitoring.inodes_percent_used_major_threshold_percent100 }}`
Description	Raises when GlusterFS uses more than 80% and less than 90% of available inodes. The `volume` label in the raised alert contains the affected GlusterFS volume.
Troubleshooting	Verify the number of objects stored in GlusterFS. If possible, increase the number of inodes.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available inodes on the GlusterFS nodes and adjust the threshold according to the number of available inodes. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold. For example, to change the threshold to the `90 - 95%` interval: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: GlusterfsInodesUsedMinor: if: >- glusterfs_inodes_percent_used >= 90 and \ glusterfs_inodes_percent_used < 95 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

GlusterfsInodesUsedMajor¶

Severity	Major
Summary	`{{ $value }}%` of GlusterFS `{{ $labels.volume }}` volume inodes are used for 2 minutes.
Raise condition	`glusterfs_inodes_percent_used >= {{ monitoring.inodes_percent_used_major_threshold_percent*100 }}`
Description	Raises when GlusterFS uses more than 90% of available inodes. The `volume` label in the raised alert contains the affected GlusterFS volume.
Troubleshooting	Verify the number of objects stored in GlusterFS. If possible, increase the number of inodes.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available inodes on the GlusterFS nodes and adjust the threshold according to the number of available inodes. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold. For example, to change the threshold to `95%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: GlusterfsInodesUsedMajor: if: >- glusterfs_inodes_percent_used >= 95 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

GlusterfsSpaceUsedMinor¶

Severity	Minor
Summary	`{{ $value }}%` of GlusterFS `{{ $labels.volume }}` volume disk space is used for 2 minutes.
Raise condition	`glusterfs_space_percent_used >= {{ monitoring.space_percent_used_minor_threshold_percent100 }} and glusterfs_space_percent_used < {{ monitoring.space_percent_used_major_threshold_percent100 }}`
Description	Raises when GlusterFS uses more than 80% and less than 90% of available space. The `volume` label in the raised alert contains the affected GlusterFS volume.
Troubleshooting	Inspect the data stored in GlusterFS. Increase the GlusterFS capacity.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available space on the GlusterFS nodes and adjust the threshold accordingly. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold. For example, to change the threshold to the `90-95%` interval: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: GlusterfsSpaceUsedMinor: if: >- glusterfs_space_percent_used >= 90 and \ glusterfs_space_percent_used < 95 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

GlusterfsSpaceUsedMajor¶

Severity	Minor
Summary	`{{ $value }}%` of GlusterFS `{{ $labels.volume }}` volume disk space is used for 2 minutes.
Raise condition	`glusterfs_space_percent_used >= {{ monitoring.space_percent_used_major_threshold_percent*100 }}`
Description	Raises when GlusterFS uses more than 90% of available space. The `volume` label in the raised alert contains the affected GlusterFS volume.
Troubleshooting	Inspect the data stored in GlusterFS. Increase the GlusterFS capacity.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available space on the GlusterFS nodes and adjust the threshold accordingly. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold. For example, to change the threshold to `95%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: GlusterfsSpaceUsedMajor: if: >- glusterfs_space_percent_used >= 95 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

GlusterfsMountMissing¶

^{Available since the 2019.2.11 maintenance update}

Severity	Major
Summary	GlusterFS mount point is not mounted.
Raise condition	`delta(glusterfs_mount_scrapes:rate5m{fstype=~"(fuse.)?glusterfs"}[5m]) < 0 or glusterfs_mount_scrapes:rate5m{fstype=~"(fuse.)?glusterfs"} == 0`
Description	Raises when a GlusterFS mount point is not mounted. The `path`, `host`, and `device` labels in the raised alert contain the path, node name, and device of the affected mount point.
Troubleshooting	To perform a manual remount, apply the `salt '*' state.sls glusterfs.client` Salt state from the Salt Master node.
Tuning	Not required

HAProxy¶

This section describes the alerts for the HAProxy service.

HaproxyServiceDown
HaproxyServiceDownMajor
HaproxyServiceOutage
HaproxyHTTPResponse5xxTooHigh
HaproxyBackendDown
HaproxyBackendDownMajor
HaproxyBackendOutage

HaproxyServiceDown¶

Severity	Minor
Summary	The HAProxy service on the `{{ $labels.host }}` node is down.
Raise condition	`haproxy_up != 1`
Description	Raises when the HAProxy service on a node does not respond to Telegraf, typically meaning that the HAproxy process is in the `DOWN` state on that node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the HAProxy status by running `systemctl status haproxy` on the affected node. If HAProxy is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

HaproxyServiceDownMajor¶

Severity	Major
Summary	More than 50% of HAProxy services within the `{{ $labels.cluster }}` cluster are down.
Raise condition	`count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+") != 1) by (cluster) >= 0.5 * count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)`
Description	Raises when the HAProxy service does not respond to Telegraf on more than 50% of cluster nodes. The `cluster` label in the raised alert contains the cluster prefix, for example, `ctl`, `dbs`, or `mon`.
Troubleshooting	Inspect the `HaproxyServiceDown` alerts for the host names of the affected nodes. Inspect `dmesg` and `/var/log/kern.log`. Inspect the logs in `/var/log/haproxy.log`. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

HaproxyServiceOutage¶

Severity	Critical
Summary	All HAProxy services within the `{{ $labels.cluster }}` cluster are down.
Raise condition	`count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+") != 1) by (cluster) == count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)`
Description	Raises when the HAProxy service does not respond to Telegraf on all nodes of a cluster, typically indicating deployment or configuration issues. The `cluster` label in the raised alert contains the cluster prefix, for example, `ctl`, `dbs`, or `mon`.
Troubleshooting	Inspect the `HaproxyServiceDown` alerts for the host names of the affected nodes. Inspect `dmesg` and `/var/log/kern.log`. Inspect the logs in `/var/log/haproxy.log`. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

HaproxyHTTPResponse5xxTooHigh¶

Severity	Warning
Summary	The average per-second rate of 5xx HTTP errors on the `{{ $labels.host }}` node for the `{{ $labels.proxy }}` back end is `{{ $value }}` (as measured over the last 2 minutes).
Raise condition	`rate(haproxy_http_response_5xx{sv="FRONTEND"}[2m]) > 1`
Description	Raises when the HTTP 5xx responses sent by HAProxy increased for the last 2 minutes, indicating a configuration issue with the HAProxy service or back-end servers within the cluster. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the HAproxy logs by running `journalctl -u haproxy` on the affected node and verify the state of the back-end servers.
Tuning	Not required

HaproxyBackendDown¶

Severity	Minor
Summary	The `{{ $labels.proxy }}` back end on the `{{ $labels.host }}` node is down.
Raise condition	`increase(haproxy_chkdown{sv="BACKEND"}[1m]) > 0`
Description	Raises when an internal HAProxy check for the back-end availability reported the back-end outage. The `host` and `proxy` labels in the raised alert contain the host name of the affected node and the service proxy name.
Troubleshooting	Inspect the HAProxy logs by running `journalctl -u haproxy` on the affected node. Verify the state of the affected back-end server: Verify that the server is responding and the back-end service is active and responsive. Verify the state of the back-end service using an HTTP GET request, for example, `curl -XGET http://ctl01:8888/`. Typically, the `200` response code indicates the healthy state.
Tuning	Not required

HaproxyBackendDownMajor¶

Severity	Major
Summary	More than 50% of `{{ $labels.proxy }}` backends are down.
Raise condition	In 2019.2.10 and prior: `0.5 * avg(sum(haproxy_active_servers{type=""server""}) by (host, proxy) + sum(haproxy_backup_servers{type=""server""}) by (host, proxy)) by (proxy) >= avg(sum(haproxy_active_servers{type=""backend""}) by (host, proxy) + sum(haproxy_backup_servers{type=""backend""}) by (host, proxy)) by (proxy)` In 2019.2.11 and newer: `avg(sum(haproxy_active_servers{type="server"}) by (host, proxy) + sum(haproxy_backup_servers{type="server"}) by (host, proxy)) by (proxy) - avg(sum(haproxy_active_servers{type="backend"}) by (host, proxy) + sum(haproxy_backup_servers{type="backend"}) by (host, proxy)) by (proxy) >= 0.5 * avg(sum(haproxy_active_servers{type="server"}) by (host, proxy) + sum(haproxy_backup_servers{type="server"}) by (host, proxy)) by (proxy)`
Description	Raises when at least half of the back-end servers (>=50%) used by the HAProxy service are in the `DOWN` state. The `host` and `proxy` labels in the raised alert contain the host name of the affected node and the service proxy name.
Troubleshooting	Inspect the HAProxy logs by running `journalctl -u haproxy` on the affected node. Verify the state of the affected back-end server: Verify that the server is responding and the back-end service is active and responsive. Verify the state of the back-end service using an HTTP GET request, for example, `curl -XGET http://ctl01:8888/`. Typically, the `200` response code indicates the healthy state.
Tuning	Not required

HaproxyBackendOutage¶

Severity	Critical
Summary	All `{{ $labels.proxy }}` backends are down.
Raise condition	`max(haproxy_active_servers{sv=""BACKEND""}) by (proxy) + max(haproxy_backup_servers{sv=""BACKEND""}) by (proxy) == 0`
Description	Raises when all back-end servers used by the HAProxy service across the cluster are not available to process the requests proxied by HAProxy, typically indicating deployment or configuration issues. The `proxy` label in the raised alert contains the service proxy name.
Troubleshooting	Verify the affected backends. Inspect the HAProxy logs by running `journalctl -u haproxy` on the affected node. Inspect Telegraf logs by running `journalctl -u telegraf` on the affected node.
Tuning	Not required

KeepalivedProcessDown¶

Severity	Major
Summary	The Keepalived process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="keepalived"} == 0`
Description	Raised when Keepalived on a particular host does not respond Telegraf, typically indicating that Keepalived is down. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the Keepalived status on the affected node using `systemctl status keepalived`. Inspect the Keepalived logs on the affected node using `journalctl -u keepalived`. Inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

KeepalivedProcessNotResponsive¶

Severity	Major
Summary	The Keepalived process on the `{{ $labels.host }}` node is not responding.
Raise condition	`keepalived_up == 0`
Description	Raises when Keepalived on a particular host does not respond to Telegraf, typically indicating that Keepalived is running but is not responsive on that node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the Keepalived status on the affected node using `service keepalived status`. Inspect the Keepalived logs on the affected node using `journalctl -u keepalived`. Inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

KeepalivedFailedState¶

Severity	Minor
Summary	The Keepalived VRRP `{{ $labels.name }}` is in the `FAILED` state on the `{{ $labels.host }}` node.
Raise condition	`keepalived_state == 0`
Description	Raises when the Keepalived Virtual Router Redundancy Protocol (VRRP) is in the `FAILED` state on a node, typically indicating network issues. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Keepalived logs on the affected node using `journalctl -u keepalived`. Inspect the Telegraf logs on the affected node using `journalctl -u telegraf`. Inspect the affected node for any network issues.
Tuning	Not required

KeepalivedUnknownState¶

Severity	Minor
Summary	The Keepalived VRRP `{{ $labels.name }}` is in the `UNKNOWN` state on the `{{ $labels.host }}` node.
Raise condition	`keepalived_state == -1`
Description	Raises when the Keepalived Virtual Router Redundancy Protocol (VRRP) is in the `UNKNOWN` state on a node, typically indicating that Keepalived has improperly reported its state or Telegraf cannot gather the state. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Keepalived logs on the affected node using `journalctl -u keepalived`. Inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

KeepalivedMultipleIPAddr¶

Severity	Major
Summary	The Keepalived `{{ $labels.ip }}` virtual IP is assigned more than once.
Raise condition	`count(ipcheck_assigned) by (ip) > 1`
Description	Raises when the virtual IP address (VIP) of Keepalived is assigned more than once (on more than one node within a cluster).
Troubleshooting	On each node of the Keepalived cluster, `ctl` nodes by default, verify if the VIP is assigned on two or more nodes or interfaces using the `ip a \| grep VIP_address` command.
Tuning	Not required

KeepalivedServiceOutage¶

Severity	Critical
Summary	All Keepalived processes within the `{{ $labels.cluster}}` cluster are down.
Raise condition	`count(label_replace(procstat_running{process_name="keepalived"}, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(procstat_running{process_name="keepalived"} == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)`
Description	Raises when all Keepalived services across the cluster do not respond to Telegraf, typically indicating configuration or deployment issues.
Troubleshooting	Inspect the `KeepalivedProcessDown` alerts for the host names of the affected nodes. Inspect the Keepalived logs on the affected nodes using `journalctl -u keepalived`. Inspect the Telegraf logs on the affected nodes using `journalctl -u telegraf`.
Tuning	Not required

libvirt¶

This section describes the alerts for the libvirt service.

LibvirtDown¶

Severity	Critical
Summary	The libvirt metric exporter fails to gather metrics on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`libvirt_up == 0`
Description	Raises when `libvirt_exporter` fails to gather metrics for 2 minutes. The `host` label in the raised alert contains the hostname of the affected node.
Troubleshooting	Verify the `libvirt-exporter` service status using `systemctl status libvirt-exporter`. Inspect the `libvirt-exporter` service logs using `journalctl -u libvirt-exporter` or in `/var/log/libvirt-exporter`. Inspect the libvirt logs using `journalctl -u libvirt` or in `/var/log/libvirt`.
Tuning	Not required

Memcached¶

This section describes the alerts for the Memcached service.

MemcachedServiceDown
MemcachedServiceRespawn
MemcachedConnectionThrottled
MemcachedConnectionsNoneMinor
MemcachedConnectionsNoneMajor
MemcachedItemsNoneMinor
MemcachedEvictionsLimit

MemcachedServiceDown¶

Severity	Minor
Summary	The Memcached service on the `{{ $labels.host }}` node is down.
Raise condition	`memcached_up == 0`
Description	Raised when Telegraf cannot gather metrics from the Memcached service, typically indicating that Memcached is down on one node and caching does not work on that node. The `host` label in the raised alert contains the host name of the affected node
Troubleshooting	Verify the Memcached service status using `systemctl status memcached`. Inspect the Memcached service logs using `journalctl -xfu memcached`.
Tuning	Not required

MemcachedServiceRespawn¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	The Memcached service on the `{{ $labels.host }}` node was respawned.
Raise condition	`memcached_uptime < 180`
Description	Raises when the Memcached service uptime is below 180 seconds, indicating that it was recently respawned (restarted). If Memcached respawning happened during maintenance, the alert is expected. Otherwise, this alert indicates an issue with the service. The `host` label in the raised alert contains the host name of the affected node. Warning The alert is a partial duplicate of `MemcachedServiceDown` and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, verify and disable this alert.
Troubleshooting	Verify the Memcached service status using `systemctl status memcached`. Inspect the Memcached service logs using `journalctl -xfu memcached`.
Tuning	Disable the alert as described in Manage alerts.

MemcachedConnectionThrottled¶

Severity	Warning
Summary	More than 5 client connections to the Memcached database on the `{{ $labels.host }}` node throttle for 2 minutes.
Raise condition	`increase(memcached_conn_yields[1m]) > 5`
Description	Raises when the number of times the Memcached connection was throttled reaches 5 over the last minute. This warning appears with the Too many open connections error message in Memcached. Too many connections may cause an error in writing because of the process starvation (blocking). To avoid this, Memcached throttles the connection. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Use `telnet` to connect to Memcached by running `telnet localhost 11211` on the affected node. Then run `stats` to obtain the server information. Inspect the Memcached service logs using `journalctl -xfu memcached`. Adjust the threshold if required.
Tuning	To change the throttling threshold to `10`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: MemcachedConnectionThrottled: if: >- increase(memcached_conn_yields[1m]) > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

MemcachedConnectionsNoneMinor¶

Severity	Minor
Summary	The Memcached database on the `{{ $labels.host }}` node has no open connections.
Raise condition	`memcached_curr_connections == 0`
Description	Raises when no connections to Memcached exist on one node, typically indicating that the connections were dropped. The state may affect performance. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Use `telnet` to connect to Memcached by running `telnet localhost 11211` on the affected node. Then run `stats` to obtain the server information. Inspect the Memcached service logs using `journalctl -xfu memcached`.
Tuning	Not required

MemcachedConnectionsNoneMajor¶

Severity	Major
Summary	The Memcached database has no open connections on all nodes.
Raise condition	`count(memcached_curr_connections == 0) == count(memcached_up)`
Description	Raises when no connections to Memcached exist on all nodes, indicating that Memcached has no client connected to it and does not receive data.
Troubleshooting	Use `telnet` to connect to Memcached by running `telnet localhost 11211` on the affected node. Then run `stats` to obtain the server information. Inspect the Memcached service logs using `journalctl -xfu memcached`.
Tuning	Not required

MemcachedItemsNoneMinor¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Minor
Summary	The Memcached database on the `{{ $labels.host }}` node is empty.
Raise condition	`memcached_curr_items == 0`
Description	Raises when a Memcached database has no items on one node. As Memcached is an in-memory database, this may be the result of Memcached respawn. Otherwise, investigate the reason. The `host` label in the raised alert contains the host name of the affected node. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Troubleshooting	To confirm the issue, use `telnet` to connect to Memcached by running `telnet localhost 11211` on the affected node. Run `stats` and search for `curr_items` and `evictions` to verify that the items were not removed before their TTL. Run `stats items` for further details on the status of the items.
Tuning	Disable the alert as described in Manage alerts.

MemcachedEvictionsLimit¶

Severity	Warning
Summary	More than 10 evictions in the Memcached database occurred on the `{{ $labels.host }}` node during the last minute.
Raise condition	`increase(memcached_evictions[1m]) > 10`
Description	Raises when the number of Memcached items that were removed before the ending of TTL has increased by 10 (default threshold) over the last minute. Memcached is used on the OpenStack controller nodes to cache the service authentication tokens. A high number of evictions indicates a heavy token rotation since old items must be removed to free the space for the new ones, based on pseudo-LRU. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Use `telnet` to connect to Memcached by running `telnet localhost 11211` on the affected node. Run `stats slabs` and search for `total_pages`, `chunk_size`, and `chunks_per_page` to verify if the slabs consume too much space.
Tuning	To change the evictions limit to `60`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert by overriding the `if` parameter: parameters: prometheus: server: alert: MemcachedEvictionsLimit: if: >- increase(memcached_evictions[1m]) > 60 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

NGINX¶

This section describes the alerts for the NGINX service.

NginxServiceDown
NginxServiceOutage
NginxDroppedIncomingConnections

NginxServiceDown¶

Severity	Minor
Summary	The NGINX service on the `{{ $labels.host }}` node is down.
Raise condition	`nginx_up != 1`
Description	Raises when the NGINX service on a host node does not respond to Telegraf, typically indicating that the NGINX service is not running on that node for 1 minute. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify the NGINX status on the affected node using `service nginx status`. If NGINX is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

NginxServiceOutage¶

Severity	Critical
Summary	All NGINX processes within the `{{ $labels.cluster }}` cluster are down.
Raise condition	`count(label_replace(nginx_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(nginx_up == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)`
Description	Raises when all NGINX services across a cluster do not respond to Telegraf, typically indicating deployment or configuration issues. The `cluster` label in the raised alert contains the prefix of a cluster, for example, `ctl`, `dbs`, or `mon`.
Troubleshooting	Inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

NginxDroppedIncomingConnections¶

Severity	Minor
Summary	NGINX drops `{{ $value }}` accepted connections per second for 5 minutes.
Raise condition	`irate(nginx_accepts[5m]) - irate(nginx_handled[5m]) > 0`
Description	Raises when NGINX has dropped the accepted connections for the last 5 minutes, indicating that NGINX does not handle every incoming connection, which may be caused by a resource or configuration limit. The `host` label contains the name of the affected node.
Troubleshooting	Inspect the NGINX logs using `journalctl -u nginx`.
Tuning	Not required

NTP¶

This section describes the alerts for the NTP service.

NtpOffsetTooHigh¶

Severity	Warning
Summary	The NTP offset on the `{{ $labels.host }}` node is more than 200 milliseconds for 2 minutes.
Raise condition	`ntpq_offset >= 200`
Description	Raises when the NTP offset on a node reaches the threshold of 200 milliseconds for 2 minutes, typically indicating that the host fails to synchronize the time with the NTP server or the NTP server is malfunctioning. A too high offset affects the metrics collection and querying the time series database. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Synchronize the time with a properly operating NTP server: Enter the NTP CLI by running `ntpq` on the affected node. List the NTP peers by running `peers` and exit the NTP CLI. Set the date and time using `ntpdate -q <peer_from_list>`. If the issue persists: Enter the NTP CLI by running `ntpq` on the affected node. List the associations by running `as`. Investigate the reason for the server rejection by running `rv <association_id>` with a chosen association ID. Inspect the output for the occurrence of `flash` code, `rootdispersion`, `dispersion`, and `jitter`. Avoid syncing with servers that have a large dispersion.
Tuning	For example, to change the threshold of the NTP offset to `500`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NtpOffsetTooHigh: if: >- ntpq_offset >= 500 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Open vSwitch¶

This section describes the alerts for the Open vSwitch (OVS) processes.

Warning

Monitoring of the OVS processes is available starting from the MCP 2019.2.3 update.
The OVSInstanceArpingCheckDown alert is available starting from the MCP 2019.2.4 update.
The OVSTooManyPortRunningOnAgent, OVSErrorOnPort, OVSNonInternalPortDown and OVSGatherFailed alerts are available starting from the MCP 2019.2.6 update.

ProcessOVSVswitchdMemoryWarning
ProcessOVSVswitchdMemoryCritical
OVSInstanceArpingCheckDown
OVSTooManyPortRunningOnAgent
OVSErrorOnPort
OVSNonInternalPortDown
OVSGatherFailed

ProcessOVSVswitchdMemoryWarning¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `ovs-vswitchd` process consumes more than 20% of system memory.
Raise condition	`procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.2`
Description	Raises when the virtual memory of the `ovs-switchd` process exceeds 20% of the host memory.
Tuning	Not required

ProcessOVSVswitchdMemoryCritical¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Critical
Summary	The `ovs-vswitchd` process consumes more than 30% of system memory.
Raise condition	`procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.3`
Description	Raises when the virtual memory of the `ovs-switchd` process exceeds 30% of the host memory.
Tuning	Not required

OVSInstanceArpingCheckDown¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Major
Summary	The OVS instance arping check is down.
Raise condition	`instance_arping_check_up == 0`
Description	Raises when the OVS instance arping check on the `{{ $labels.host }}` node is down for 2 minutes. The `host` label in the raised alert contains the affected node name.
Tuning	Not required

OVSTooManyPortRunningOnAgent¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Major
Summary	The number of OVS port is `{{ $value }}` (`ovs-vsctl list port`) on the `{{ $labels.host }}` host, which is more than the expected limit.
Raise condition	`sum by (host) (ovs_bridge_status) > 1500`
Description	Raises when too many networks are created or OVS does not properly clean up the OVS ports. OVS may malfunction if too many ports are assigned to a single agent. Warning For production environments, configure the alert after deployment.
Troubleshooting	Run `ovs-vsctl show` from the affected node and `openstack port list` from the OpenStack controller nodes and inspect the existing ports. Remove the unneeded ports or redistribute the OVS ports.
Tuning	For example, to change the threshold to `1600`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: OVSTooManyPortRunningOnAgent: if: >- sum by (host) (ovs_bridge_status) > 1600 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

OVSErrorOnPort¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	The `{{ $labels.port }}` OVS port on the `{{ $labels.bridge }}` bridge running on the `{{ $labels.host }}` host is reporting errors.
Raise condition	`ovs_bridge_status == 2`
Description	Raises when an OVS port reports errors, indicating that the port is not working properly.
Troubleshooting	From the affected node, run `ovs-vsctl show`. Inspect the output for `error` entries.
Tuning	Not required

OVSNonInternalPortDown¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	The `{{ $labels.port }}` OVS port on the `{{ $labels.bridge }}` bridge running on the `{{ $labels.host }}` host is down.
Raise condition	`ovs_bridge_status{type!="internal"} == 0`
Description	Raises when the port on the OVS bridge is in the `DOWN` state, which may lead to an unexpected network disturbance.
Troubleshooting	From the affected node, run `ip a` to verify if the port is in the `DOWN` state. If required, bring the port up using `ifconfig <interface> up`.
Tuning	Note required

OVSGatherFailed¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	Failure to gather the OVS information on the `{{ $labels.host }}` host.
Raise condition	`ovs_bridge_check == 0`
Description	Raises when the check script for the OVS bridge fails to gather data. OVS is not monitored.
Troubleshooting	Run `/usr/local/bin/ovs_parse_bridge.py` from the affected host and inspect the output.
Tuning	Not required

RabbitMQ¶

This section describes the alerts for the RabbitMQ service.

RabbitmqServiceDown
RabbitmqServiceOutage
RabbitMQUnequalQueueCritical
RabbitmqDiskFullWarning
RabbitmqDiskFullCritical
RabbitmqMemoryLowWarning
RabbitmqMemoryLowCritical
RabbitmqMessagesTooHigh
RabbitmqErrorLogsTooHigh
RabbitmqErrorLogsMajor
RabbitmqFdUsageWarning
RabbitmqFdUsageCritical

RabbitmqServiceDown¶

Severity	Critical
Summary	The RabbitMQ service on the `{{$labels.host}}` node is down.
Raise condition	`rabbitmq_up == 0`
Description	Raises when the RabbitMQ service is down on one node, which affects the RabbitMQ availability. The alert raises 1 minute after the issue occurrence. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the RabbitMQ status using `systemctl status rabbitmq-server`. Inspect the RabbitMQ logs in `/var/log/rabbitmq`. Verify that the node has enough resources, such as disk space or RAM.
Tuning	Not required

RabbitmqServiceOutage¶

Severity	Critical
Summary	All RabbitMQ services are down.
Raise condition	`count(rabbitmq_up == 0) == count(rabbitmq_up)`
Description	Raises when RabbitMQ is down on all nodes, indicating that the service is unavailable. The alert raises 1 minute after the issue occurrence.
Troubleshooting	Verify the RabbitMQ status on the `msg` nodes using `systemctl status rabbitmq-server`. Inspect the RabbitMQ logs in the `/var/log/rabbitmq` directory on the `msg` nodes. Verify that the node has enough resources such as disk space or RAM.
Tuning	Not required

RabbitMQUnequalQueueCritical¶

^{Available starting from the 2019.2.5 maintenance update}

Severity	Critical
Summary	The RabbitMQ service has unequal number of queues across the cluster instances.
Raise condition	`max(rabbitmq_overview_queues) != min(rabbitmq_overview_queues)`
Description	Raises when the RabbitMQ cluster nodes have inconsistent number of queues for 10 minutes. This issue can occur after service restart and cause the inaccessibility of RabbitMQ.
Troubleshooting	Contact Mirantis support.
Tuning	Not required

RabbitmqDiskFullWarning¶

Severity	Warning
Summary	The RabbitMQ service on the `{{$labels.host}}` node has less than 500 MB of free disk space.
Raise condition	`rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 10`
Description	Rasies when the consumption of the available disk space by RabbitMQ reaches 500 MB (by default, 10 multiplied by 50 MB). RabbitMQ checks the available disk space more frequently as it shrinks, which can affect system load.
Troubleshooting	Free or add more disk space on the affected node.
Tuning	To change the threshold to `15`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqDiskFullWarning: if: >- rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 15 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqDiskFullCritical¶

Severity	Critical
Summary	The RabbitMQ disk space on the `{{$labels.host}}` node is full.
Raise condition	`rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit`
Description	Raises when RabbitMQ uses all the available disk space (less than 50 MB by default). The alert is cluster-wide. RabbitMQ blocks producers and in-memory messages from paging to the disk. Frequent disk checks contribute to load growth. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Add more disk space on the affected node.
Tuning	Not required

RabbitmqMemoryLowWarning¶

Severity	Warning
Summary	The RabbitMQ service uses more than 80% of memory on the `{{$labels.host}}` node for 2 minutes.
Raise condition	`100 * rabbitmq_node_mem_used / rabbitmq_node_mem_limit >= 100 * 0.8`
Description	Raises when the RabbitMQ memory consumption reaches the warning threshold (80% of allocated memory by default). The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Edit the high memory watermark in the service configuration. Increase paging for RabbitMQ through CLI. For example: rabbitmqctl -n rabbit@msg01 set_global_parameter \ vm_memory_high_watermark_paging_ratio 0.75 The service restart will reset this change. Add more memory on the affected node.
Tuning	To change the watermark: On the cluster level of the Reclass model, specify the `vm_high_watermark` parameter in `openstack/message_queue.yml`. For example: rabbitmq: server: memory: vm_high_watermark: 0.8 From the Salt Master node, apply the following states one by one: salt '*' saltutil.refresh_pillar salt -C 'I@rabbitmq:server' state.sls rabbitmq.server

RabbitmqMemoryLowCritical¶

Severity	Critical
Summary	The RabbitMQ service on the `{{$labels.host}}` node is out of memory.
Raise condition	`rabbitmq_node_mem_used >= rabbitmq_node_mem_limit`
Description	RabbitMQ uses all the allocated memory and blocks all connections that are publishing messages to prevent further usage growth. The `host` label in the raised alert contains the host name of the affected node. If other system services consume more RAM, the system may start to swap, which can cause the Erlang VM crash and RabbitMQ can go down on the node.
Troubleshooting	Increase paging for RabbitMQ through CLI. For example: rabbitmqctl -n rabbit@msg01 set_global_parameter \ vm_memory_high_watermark_paging_ratio 0.75 The service restart will reset this change. Add more memory on the affected node.
Tuning	Not required

RabbitmqMessagesTooHigh¶

Severity	Warning
Summary	The RabbitMQ service on the `{{$labels.host}}` node has received more than 2^20 messages.
Raise condition	`rabbitmq_overview_messages > 2^20`
Description	Raises when the quantity of messages received by RabbitMQ exceeds the warning limit, (by default, 1024 multiplied by 1024), typically indicating that some consumer may not peek messages from the queues.
Troubleshooting	Verify if huge queues are present using `rabbitmqctl list_queues`.
Tuning	For example, to change the disk warning threshold to `2^21`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqMessagesTooHigh: if: >- rabbitmq_overview_messages > 2^21 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsTooHigh¶

Severity	Critical ^{Major in 2019.2.4}
Summary	The rate of errors in RabbitMQ logs is more than 0.2 error messages per second on the `{{$labels.host}}` node as measured over 5 minutes.
Raise condition	In 2019.2.4: `sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error\|emergency\|fatal))"}[5m])) without (level) > 0.2` In 2019.2.5: `sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error\|critical))"}[5m])) without (level) > 0.05`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in the RabbitMQ logs on the node is more than 0.2 per second. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/rabbitmq` directory on the affected node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the RabbitMQ logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqErrorLogsTooHigh: if: >- sum(rate(log_messages{service=""rabbitmq"", level=~""(?i:\ (error\|emergency\|fatal))""}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsMajor¶

^{Available starting from the 2019.2.5 maintenance update}

Severity	Major
Summary	The RabbitMQ logs on the `{{ $labels.host }}` node contain errors (as measured over the last 30 minutes).
Raise condition	`sum(increase(log_messages{service="rabbitmq",level=~"(?i:(error\| critical))"}[30m])) without (level) > 0`
Description	Raises when the `error` or `critical` messages appear in the RabbitMQ logs on a node. The `host` label in the raised alert contains the name of the affected node. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity.
Troubleshooting	Inspect the log files in the `/var/log/rabbitmq` directory on the affected node.
Tuning	Not required

RabbitmqFdUsageWarning¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Warning
Summary	The RabbitMQ service uses `{{ $value }}%` of all available file descriptors on the `{{ $labels.host }}` node.
Raise condition	`rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 70`
Description	Raises when the RabbitMQ instance uses 70% of available file descriptors. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect `openstack/control.yml` in the cluster model to verify if the default value of the OpenStack service `rpc_workers` was overwritten. Decrease the `rpc_workers` value and apply the state for the corresponding service.
Tuning	For example, to change the threshold to `60%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqFdUsageWarning: if: >- rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 60 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqFdUsageCritical¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	The RabbitMQ service uses `{{ $value }}%` of all available file descriptors on the `{{ $labels.host }}` node.
Raise condition	`rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 95`
Description	Raises when the RabbitMQ instance uses 95% of available file descriptors, indicating that RabbitMQ is about to crash. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect `openstack/control.yml` in the cluster model to verify if the default value of the OpenStack service `rpc_workers` was overwritten. Decrease the `rpc_workers` value and apply the state for the corresponding service.
Tuning	For example, to change the threshold to `87%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqFdUsageCritical: if: >- rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 87 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Reclass¶

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes the Reclass modifications alerts.

ReclassUnstagedChanges
ReclassStagedChanges
ReclassRemoteDesync

ReclassUnstagedChanges¶

Severity	Warning
Summary	`{{ $labels.value }}` files under the `{{ $labels.directory }}` directory have been modified without being committed.
Raise condition	`reclass_files_unstaged_changes > 0`
Description	Raises when the Reclass model has been modified but the local changes have not been added to a potential Git commit.
Troubleshooting	If the alert is unexpected, reset the model to the remote `HEAD`. If the alert is expected, add the changes using `git add`.
Tuning	Not required

ReclassStagedChanges¶

Severity	Warning
Summary	`{{ $labels.value }}` files under the `{{ $labels.directory }}` directory have diverged from the expected local state.
Raise condition	`reclass_files_staged_changes > 0`
Description	Raises when the local changes in the Reclass model have been added to a Git commit but not committed yet.
Troubleshooting	If the alert is unexpected, reset the model to the remote `HEAD`. If the alert is expected, commit the changes using `git commit`.
Tuning	Not required

ReclassRemoteDesync¶

Severity	Warning
Summary	`{{ $labels.value }}` local files under the `{{ $labels.directory }}` directory have diverged from the expected remote state.
Raise condition	`reclass_remote_desync > 0`
Description	Raises when the local and remote configurations in the Reclass model have diverged, meaning that the local history has changed.
Troubleshooting	If the alert is unexpected, reset the model to the remote `HEAD`. If the alert is expected, push the changes using `git push`.
Tuning	Not required

Salt¶

This section describes the alerts for the Salt Master and Salt Minion services.

SaltMasterServiceDown
SaltMinionServiceDown

SaltMasterServiceDown¶

Severity	Critical
Summary	The `salt-master` service on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="salt-master"} == 0`
Description	Raises when Telegraf cannot find running `salt-master` processes on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the `salt-master` service status using `systemctl status salt-master`. Inspect the `salt-master` service logs in `/var/log/salt/master`.
Tuning	Not required

SaltMinionServiceDown¶

Severity	Critical
Summary	The `salt-minion` service on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="salt-minion"} == 0`
Description	Raises when Telegraf cannot find running `salt-minion` processes on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the `salt-minion` service status using `systemctl status salt-minion`. Inspect the `salt-minion` service logs in `/var/log/salt/minion`.
Tuning	Not required

SMART disks¶

This section describes the alerts for SMART disks.

Warning

SMART disks monitoring is available starting from the MCP 2019.2.3 update. For the existing MCP deployments, manually enable SMART disks monitoring as described in Enable SMART disk monitoring.

Warning

SMART disks alerts have been removed starting from the MCP 2019.2.9 update.

SystemSMARTDiskUDMACrcErrorsTooHigh
SystemSMARTDiskHealthStatus
SystemSMARTDiskReadErrorRate
SystemSMARTDiskSeekErrorRate
SystemSMARTDiskTemperatureHigh
SystemSMARTDiskReallocatedSectorsCount
SystemSMARTDiskCurrentPendingSectors
SystemSMARTDiskReportedUncorrectableErrors
SystemSMARTDiskOfflineUncorrectableSectors
SystemSMARTDiskEndToEndError

SystemSMARTDiskUDMACrcErrorsTooHigh¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting UDMA CRC errors for 5 minutes.
Raise condition	`increase(smart_device_udma_crc_errors[1m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `SMART UDMA CRC` error messages on a host every minute for 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskUDMACrcErrorsTooHigh: if: >- increase(smart_device_udma_crc_errors[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskHealthStatus¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting bad health status for 1 minute.
Raise condition	`smart_device_health_ok == 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects bad health status of a SMART disk on a host for 1 minute. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	Not required

SystemSMARTDiskReadErrorRate¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting an increased read error rate for 5 minutes.
Raise condition	`increase(smart_device_read_error_rate[1m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `ReadErrorRate` messages on a host every minute for the last 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskReadErrorRate: if: >- increase(smart_device_read_error_rate[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskSeekErrorRate¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting an increased seek error rate for 5 minutes.
Raise condition	`increase(smart_device_seek_error_rate[1m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `SeekErrorRate` messages on a host every minute for the last 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	To change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskSeekErrorRate: if: >- increase(smart_device_seek_error_rate[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskTemperatureHigh¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has a temperature of `{{ $value }}C` for 5 minutes.
Raise condition	`smart_device_temp_c >= 60`
Description	Raises when the SMART Telegraf input plugin detects that the SMART disk temperature on a host is above the default threshold of 60°C for the last 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to `>=40C degrees`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskTemperatureHigh: if: >- smart_device_temp_c >= 40 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskReallocatedSectorsCount¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has reallocated `{{ $value }}` sectors.
Raise condition	`increase(smart_attribute_raw_value{name="Reallocated_Sector_Ct"}[10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `ReallocatedSectorsCount` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskReallocatedSectorsCount: if: >- increase(smart_attribute_raw_value\ {name="Reallocated_Sector_Ct"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskCurrentPendingSectors¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `current pending` sectors.
Raise condition	`increase(smart_attribute_raw_value{name="Current_Pending_Sector"} [10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `CurrentPendingSectors` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskCurrentPendingSectors: if: >- increase(smart_attribute_raw_value\ {name="Current_Pending_Sector"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskReportedUncorrectableErrors¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `reported uncorrectable` errors.
Raise condition	`increase(smart_attribute_raw_value{name="Reported_Uncorrect"}[10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `ReportedUncorrectableErrors` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskReportedUncorrectableErrors: if: >- increase(smart_attribute_raw_value\ {name="Reported_Uncorrect"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskOfflineUncorrectableSectors¶

^{Available starting from the 2019.2.5 maintenance update}

Severity	Major
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `offline uncorrectable` sectors.
Raise condition	`smart_attribute_raw_value{name="Offline_Uncorrectable"} > 0`
Description	Raises when the SMART input plugin detects that the `Offline_Uncorrectable` parameter is not `0`. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	Not required

SystemSMARTDiskEndToEndError¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `end-to-end` errors.
Raise condition	`increase(smart_attribute_raw_value{name="End-to-End_Error"}[10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `EndToEndError` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskEndToEndError: if: >- increase(smart_attribute_raw_value\ {name="End-to-End_Error"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SSL certificates¶

This section describes the alerts for SSL certificates.

Warning

Monitoring of the functionality is available starting from the MCP 2019.2.3 update.

CertificateExpirationWarning
CertificateExpirationCritical

CertificateExpirationWarning¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.source }}` certificate on the `{{ $labels.host }}` node expires in less than 60 days.
Raise condition	`x509_cert_expiry / (24 * 60 * 60) < 60`
Description	Raises when a certificate on a node expires in less than 60 days. The `host` and `source` labels in the raised alert contain the host name of the affected node and the path to the certificate.
Troubleshooting	Update the certificates.
Tuning	Not required

CertificateExpirationCritical¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Critical
Summary	The `{{ $labels.source }}` certificate on the `{{ $labels.host }}` node expires in less than 30 days.
Raise condition	`x509_cert_expiry / (24 * 60 * 60) < 30`
Description	Raises when a certificate on a node expires in less than 30 days. The `host` and `source` labels in the raised alert contain the host name of the affected node and the path to the certificate.
Troubleshooting	Update the certificates.
Tuning	Not required

System¶

This section describes the system alerts.

SystemCpuFullWarning¶

Severity	Warning
Summary	The average CPU usage on the `{{ $labels.host }}` node is more than 90% for 2 minutes.
Raise condition	`100 - avg_over_time(cpu_usage_idle{cpu="cpu-total"}[5m]) > 90`
Description	Raises when the average CPU idle time on a node is less than 10% for the last 5 minutes, indicating that the node is under load. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `top` command on the affected node.
Tuning	For example, to change the threshold to `20%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemCpuFullWarning: if: >- 100 - avg_over_time(cpu_usage_idle{cpu=""cpu-total""}[5m]) > 80 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemLoadTooHighWarning¶

Severity	Warning
Summary	The system load per CPU on the `{{ $labels.host }}` node is more than `1` for 5 minutes.
Raise condition	In 2019.2.7 and prior: `system_load5 / system_n_cpus > 1` In 2019.2.8: `system_load15 / system_n_cpus > 1` In 2019.2.9 and newer: `system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > {{ load_threshold }}`
Description	Raises when the average load on the node is higher than `1` per CPU core over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `uptime` and `top` commands on the affected node.
Tuning	For example, to change the threshold to `1.5` per core: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter depending on the raise conditions provided above. For example: parameters: prometheus: server: alert: SystemLoadTooHighWarning: if: >- system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 1.5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemLoadTooHighCritical¶

Severity	Critical ^{Prior to 2019.2.8, warning}
Summary	The system load per CPU on the `{{ $labels.host }}` node is more than `2` for 5 minutes.
Raise condition	In 2019.2.7 and prior: `system_load5 / system_n_cpus > 2` In 2019.2.8: `system_load15 / system_n_cpus > 2` In 2019.2.9 and newer: `system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > {{ load_threshold }}`
Description	Raises when the average load on the node is higher than `2` per CPU over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `uptime` and `top` commands on the affected node.
Tuning	For example, to change the threshold to `3` per core: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter depending on the raise conditions provided above. For example: parameters: prometheus: server: alert: SystemLoadTooHighCritical: if: >- system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 3 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskFullWarning¶

Severity	Warning
Summary	The disk partition (`{{ $labels.path }}`) on the `{{ $labels.host }}` node is more than 85% full for 2 minutes.
Raise condition	`disk_used_percent >= 85`
Description	Raises when the disk partition on a node is 85% full. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free disk space on the node using the `df` command. Increase the disk space on the affected node or remove unused data.
Tuning	For example, to change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskFullWarning: if: >- disk_used_percent >= 90 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskFullMajor¶

Severity	Major
Summary	The disk partition (`{{ $labels.path }}`) on the `{{ $labels.host }}` node is 95% full for 2 minutes.
Raise condition	`disk_used_percent >= 95`
Description	Raises when the disk partition on a node is 95% full. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free disk space on the node using the `df` command. Increase the disk space on the affected node or remove unused data.
Tuning	For example, to change the threshold to `99%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskFullMajor: if: >- disk_used_percent >= 99 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskInodesFullWarning¶

Severity	Warning
Summary	The `{{ $labels.host }}` node uses more than 85% of disk inodes in the `{{ $labels.path }}` volume for 2 minutes.
Raise condition	`100 * disk_inodes_used / disk_inodes_total >= 85.0`
Description	Raises when the usage of inodes of a disk partition on the node is 85%. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free inodes on the affected node using the `df -i` command. If the disk is not full on the affected node, identify the reason for the inodes leak or remove unused files.
Tuning	For example, to change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskInodesFullWarning: if: >- 100 * disk_inodes_used / disk_inodes_total >= 90 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskInodesFullMajor¶

Severity	Major
Summary	The `{{ $labels.host }}` node uses more than 95% of disk inodes in the `{{ $labels.path }}` volume for 2 minutes.
Raise condition	`100 * disk_inodes_used / disk_inodes_total >= 95.0`
Description	Raises when the usage of inodes of a disk partition on the node is 95%. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free inodes on the affected node using the `df -i` command. If the disk is not full on the affected node, identify the reason for the inodes leak or remove unused files.
Tuning	For example, to change the threshold to `99%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskInodesFullMajor: if: >- 100 * disk_inodes_used / disk_inodes_total >= 99 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskErrorsTooHigh¶

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting errors for 5 minutes.
Raise condition	`increase(hdd_errors_total[1m]) > 0`
Description	Raises when disk error messages were detected every minute for the last 5 minutes in the syslog (`/var/log/syslog`) on a host. Fluentd parses the syslog for the `error.\b[sv]d[a-z]{1,2}\d{0,3}\b.` and `\b[sv]d[a-z]{1,2}\d{0,3}\b.*error` regular expressions and increases the count in case of success. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the `syslog` journal for `error` words using `grep`.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskErrorsTooHigh: if: >- increase(hdd_errors_total[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskBacklogWarning¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Warning
Summary	Disk `{{ $labels.name }}` backlog warning.
Raise condition	`rate(diskio_weighted_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z] ?[0-9]?)"}[1m]) / 1000 > 10`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node exceeded the concurrency level of 10 during the last 10 minutes.

SystemDiskBacklogCritical¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Critical
Summary	Disk `{{ $labels.name }}` backlog critical.
Raise condition	`rate(diskio_weighted_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z ?[0-9]?)"}[1m]) / 1000 > 20`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node exceeded the concurrency level of 20 during the last 10 minutes.

SystemDiskRequestQueuedWarning¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Warning
Summary	Disk `{{ $labels.name }}` requests were queued for 90% of time.
Raise condition	`rate(diskio_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z]?[0-9]?) "}[1m]) / 1000 > 0.9`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node spent 90% of the device time in queue during the last 10 minutes.

SystemDiskRequestQueuedCritical¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Critical
Summary	Disk `{{ $labels.name }}` requests were queued for 98% of time.
Raise condition	`rate(diskio_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z]?[0-9]?) "}[1m]) / 1000 > 0.98`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node spent 98% of the device time in queue during the last 10 minutes.

SystemMemoryFullWarning¶

Severity	Warning
Summary	The `{{ $labels.host }}` node uses `{{ $value }}%` of memory for 2 minutes.
Raise condition	`mem_used_percent > 90 and mem_available < 8 * 2^30`
Description	Raises when a node uses more than 90% of RAM and less than 8 GB of memory is available, indicating that the node is under a high load. The `host` label in the raised alert contains the name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Verify the free and used RAM on the affected node using `free -h`. Identify the service that consumes RAM.
Tuning	For example, to change the threshold to `80%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemMemoryFullWarning: if: >- mem_used_percent > 80 and mem_available < 8 * 2^30 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemMemoryFullMajor¶

Severity	Major
Summary	The `{{ $labels.host }}` node uses `{{ $value }}%` of memory for 2 minutes.
Raise condition	`mem_used_percent > 95 and mem_available < 4 * 2^30`
Description	Raises when a node uses more than 95% of RAM and less than 4 GB of memory is available, indicating that the node is under a high load. The `host` label in the raised alert contains the name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Verify the free and used RAM on the affected node using `free -h`. Identify the service that consumes RAM.
Tuning	For example, to change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemMemoryFullMajor: if: >- mem_used_percent > 90 and mem_available < 4 * 2^30 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSwapFullWarning¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Warning
Summary	The swap on the `{{ $labels.host }}` node is more than 50% used for 2 minutes.
Raise condition	`swap_used_percent >= 50.0`
Description	Raises when the swap on a node is 50% used, indicating that the node is under a high load or out of RAM. The `host` label in the raised alert contains the name of the affected node. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Troubleshooting	Verify the free and used RAM and swap on the affected node using `free -h`. Identify the service that consumes RAM and swap.
Tuning	Disable the alert as described in Manage alerts.

SystemSwapFullMinor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Minor
Summary	The swap on the `{{ $labels.host }}` node is more than 90% used for 2 minutes.
Raise condition	`swap_used_percent >= 90.0`
Description	Raises when the swap on a node is 90% used, indicating that the node is under a high load or out of RAM. The `host` label contains the name of the affected node. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Troubleshooting	Verify the free and used RAM and swap on the affected node using `free -h`. Identify the service that consumes RAM and swap.
Tuning	Disable the alert as described in Manage alerts.

SystemRxPacketsDroppedTooHigh¶

Severity	Warning
Summary	More than 60 packets received by the `{{ $labels.interface }}` interface on the `{{ $labels.host }}` node were dropped during the last minute.
Raise condition	`increase(net_drop_in[1m]) > 60 unless on (host,interface) bond_slave_active == 0`
Description	Raises when the number of dropped RX packets on an interface (except the bond slave interfaces that are in the `BACKUP` state) is higher than 60 for the last minute, according to the data from `/proc/net/dev` of the affected node. The `host` and `interface` labels in the raised alert contain the name of the affected node and the affected interface on that node. The reasons can be as follows: Full `softnet` backlog Wrong VLAN tags, packets received with unknown or unregistered protocols IPv6 frames in case if the server is configured only for IPv4 Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 600 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemRxPacketsDroppedTooHigh: if: >- increase(net_drop_in[1m]) > 600 unless on (host,interface) \ bond_slave_active == 0 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemTxPacketsDroppedTooHigh¶

Severity	Warning
Summary	More than 100 packets transmitted by the `{{ $labels.interface }}` interface on the `{{ $labels.host }}` node were dropped during the last minute.
Raise condition	`increase(net_drop_out[1m]) > 100`
Description	Raises when the number of dropped TX packets on the interface is higher than 100 for the last 1 minute, according to the data from `/proc/net/dev` of the affected node. The `host` and `interface` labels in the raised alert contain the name of the affected node and the affected interface on that node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 1000 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemTxPacketsDroppedTooHigh: if: >- increase(net_drop_out[1m]) > 1000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemRxPacketsErrorTooHigh¶

^{Available starting from the 2019.2.15 maintenance update}

Severity	Warning
Summary	`{{ net_rx_error_threshold }}{%- raw %}` received packets had errors.
Raise condition	`increase(net_err_in[1m]) > {{ net_rx_error_threshold }}`
Description	Raises when the number of packets received by an interface on a node had errors during the last minute. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 600 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemRxPacketsErrorTooHigh: if: >- increase(net_err_in[1m]) > 600 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemTxPacketsErrorTooHigh¶

^{Available starting from the 2019.2.15 maintenance update}

Severity	Warning
Summary	`{{ net_tx_error_threshold }}{%- raw %}` transmitted packets had errors.
Raise condition	`increase(net_err_out[1m]) > {{ net_tx_error_threshold }}`
Description	Raises when a number of packets transmitted by an interface on a node had errors during the last minute. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 1000 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemTxPacketsErrorTooHigh: if: >- increase(net_err_out[1m]) > 1000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CronProcessDown¶

Severity	Critical
Summary	The cron process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="cron"} == 0`
Description	Raises when Telegraf cannot find running `cron` processes on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the `cron` service status using `systemctl status cron`. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

SshdProcessDown¶

Severity	Critical
Summary	The SSH process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="sshd"} == 0`
Description	Raises when Telegraf cannot find running `sshd` processes on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the `sshd` service status using `systemctl status sshd`. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

SshFailedLoginsTooHigh¶

Severity	Warning
Summary	More than 5 failed SSH login attempts on the `{{ $labels.host }}` node during the last 5 minutes.
Raise condition	`increase(failed_logins_total[5m]) > 5`
Description	Raises when more than 5 `failed logins` log messages were detected in the syslog (`/var/log/syslog`) for the last 5 minutes. Fluentd parses the syslog for the `^Invalid user` regular expressions and increases the count in case of success. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the `syslog` journal for `Invalid user` words using `grep`.
Tuning	For example, to change the threshold to 50 packets per 1 hour: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SshFailedLoginsTooHigh: if: >- increase(failed_logins_total[1h]) > 50 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

PacketsDroppedByCpuWarning¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.cpu }}` CPU on the `{{ $labels.host }}` node dropped `{{ $value }}` packets during the last 10 minutes.
Raise condition	`floor(increase(nstat_packet_drop[10m])) > 0`
Description	Raises when the number of packets dropped by CPU due to the lack of space in the processing queue is more than 0 for the last 10 minutes, according to the data in column 2 of the `/proc/net/softnet_stat` file. CPU starts to drop associated packets when its queue (backlog) is full because the interface receives packets faster than the kernel can process them. The `host` and `cpu` labels in the raised alert contain the name of the affected node and the CPU. Warning For production environments, configure the alert after deployment.
Troubleshooting	Increase the backlog size by modifying the `net.core.netdev_max_backlog` kernel parameter. For example: sudo sysctl -w net.core.netdev_max_backlog=3000 For details, see kernel documentation.
Tuning	For example, to change the threshold to 100 packets per 1 hour: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: PacketsDroppedByCpuWarning: if: >- floor(increase(nstat_packet_drop[1h])) > 100 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

PacketsDroppedByCpuMinor¶

Severity	Minor
Summary	The `{{ $labels.cpu }}` CPU on the `{{ $labels.host }}` node dropped `{{ $value }}` packets during the last 10 minutes.
Raise condition	`floor(increase(nstat_packet_drop[10m])) > 100`
Description	Raises when the number of packets dropped by CPU due to the lack of space in the processing queue is more than 100 for the last 10 minutes, according to the data in column 2 of the `/proc/net/softnet_stat` file. CPU starts to drop associated packets when its queue (backlog) is full because the interface receives packets faster than the kernel can process them. The `host` and `cpu` labels in the raised alert contain the name of the affected node and the CPU. Warning For production environments, configure the alert after deployment.
Troubleshooting	Increase the backlog size by modifying the `net.core.netdev_max_backlog` kernel parameter. For example: sudo sysctl -w net.core.netdev_max_backlog=3000 For details, see kernel documentation.
Tuning	For example, to change the threshold to 500 packets per 1 hour: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: PacketsDroppedByCpuMinor: if: >- floor(increase(nstat_packet_drop[1h])) > 500 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

NetdevBudgetRanOutsWarning¶

Severity	Warning
Summary	The rate of `net_rx_action` loops terminations on the `{{ $labels.host }}` node is `{{ $value }}` per second during the last 5 minutes.
Raise condition	`max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.1`
Description	Raises when the average rate of the `net_rx_action` loop terminations is greater than 0.1 per second for the last 5 minutes, according to the data in column 3 of the `/proc/net/softnet_stat` file. The alert typically indicates budget consumption or reaching of the time limit. The `net_rx_action` loop starts processing the packets from the memory to which the device transferred the packets through Direct Memory Access (DMA). Running out of budget or time can cause loop termination. Terminations of `net_rx_action` indicate that the CPU does not have enough quotas (budget or time) to proceed with all associated packets or some drivers encounter system issues. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Verify the number of dropped packets by CPU. If no dropped packets exist, increase the budget or time interval to resolve the issue: Increase the budget by modifying the `net.core.netdev_budget` kernel parameter. For example: sysctl -w net.core.netdev_budget=600 For details, see kernel documentation. Increase the time interval by modifying the `net.core.netdev_budget_usecs` kernel parameter. For example: sudo sysctl -w net.core.netdev_budget_usecs=5000 For details, see kernel documentation.
Tuning	For example, to change the threshold to `0.2`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NetdevBudgetRanOutsWarning: if: >- max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.2 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemCpuIoWaitWarning¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Warning
Summary	CPU waited for I/O 40% of time.
Raise condition	`cpu_usage_iowait > 40`
Description	The CPU on the `{{ $labels.host }}` node spent 40% of time waiting for I/O.

SystemCpuIoWaitCritical¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Critical
Summary	CPU waited for I/O 50% of time.
Raise condition	`cpu_usage_iowait > 50`
Description	The CPU on the `{{ $labels.host }}` node spent 50% of time waiting for I/O.

SystemCpuStealTimeWarning¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Warning
Summary	The CPU steal time was above 5.0% on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`cpu_usage_steal > 5.0`
Description	Raises when a VM vCPU spends 5% of time waiting for real CPU for the last 5 minutes, typically occurring during high load in case of CPU shortage. Waiting for resources slows down the processes in the VM. Warning For production environments, configure the alert after deployment.
Tuning	For example, to change the threshold to `2%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemCpuStealTimeWarning: if: >- cpu_usage_steal > 2.0 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemCpuStealTimeCritical¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	The CPU steal time was above 10.0% on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`cpu_usage_steal > 10.0`
Description	Raises when a VM vCPU spends 10% of time waiting for real CPU for the last 5 minutes, typically occurring during high load in case of CPU shortage. Waiting for resources slows down the processes in the VM. Warning For production environments, configure the alert after deployment.
Tuning	For example, to change the threshold to `5%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemCpuStealTimeCritical: if: >- cpu_usage_steal > 5.0 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

OpenStack¶

This section describes the alerts available for the OpenStack services.

OpenStack service endpoints¶

Note

This feature is available starting from the MCP 2019.2.11 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes the alert for the OpenStack service endpoints outage.

OpenstackServiceEndpointDown

OpenstackServiceEndpointDown¶

Severity	Critical
Summary	OpenStack API endpoint is down.
Raise condition	`openstack_api_check_status == 0`
Description	Raises when an API endpoint from the OpenStack service catalog did not pass the HTTP-probe check. The `service_name` and `interface` labels in the raised alert contain the affected service name and interface.
Troubleshooting	Verify the availability of the affected endpoint from the OpenStack service catalog. For a list of all available endpoints, run openstack endpoint list.
Tuning	Not required

Cinder¶

This section describes the alerts for Cinder.

CinderApiOutage
CinderApiDown
CinderApiEndpointDown
CinderApiEndpointDownMajor
CinderApiEndpointsOutage
CinderServiceDown
CinderServicesDownMinor
CinderServicesDownMajor
CinderServiceOutage
CinderVolumeProcessDown
CinderVolumeProcessesDownMinor
CinderVolumeProcessesDownMajor
CinderVolumeServiceOutage
CinderErrorLogsTooHigh

CinderApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Cinder API is not accessible for all available Cinder endpoints in the OpenStack service catalog.
Raise condition	`max(openstack_api_check_status{name=~"cinder.*"}) == 0`
Description	Raises when the checks against all available internal Cinder endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Cinder, Cinderv2, and Cinderv3 are `200` and `300`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Cinder endpoints (URLs) from the output of `openstack endpoint list`.
Tuning	Not required

CinderApiDown¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Major
Summary	Cinder API is not accessible for the `{{ $labels.name }}` endpoint.
Raise condition	`openstack_api_check_status{name=~"cinder.*"} == 0`
Description	Raises when the check against one available internal Cinder endpoints in the OpenStack service catalog does not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Cinder, Cinderv2, and Cinderv3 are `200` and `300`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Cinder endpoints (URLs) from the output of the `openstack endpoint list` command.
Tuning	Not required

CinderApiEndpointDown¶

Severity	Minor
Summary	The cinder-api endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"cinder-api"} == 0`
Description	Raises when the check against a Cinder API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. The `host` label in the raised alert contains the host name of the affected node. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

CinderApiEndpointDownMajor¶

Severity	Major
Summary	More than 50% of cinder-api endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"cinder-api"} == 0) >= count(http_response_status{name=~"cinder-api"}) * 0.5`
Description	Raises when the check against a Cinder API endpoint does not pass on more than 50% of OpenStack controller nodes. For details on the affected nodes, see the `host` label in the `CinderApiEndpointDown` alerts. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `CinderApiEndpointDown` alerts for the host names of the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

CinderApiEndpointsOutage¶

Severity	Critical
Summary	All available cinder-api endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"cinder-api"} == 0) == count(http_response_status{name=~"cinder-api"})`
Description	Raises when the check against a Cinder API endpoint does not pass on all OpenStack controller nodes. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `CinderApiEndpointDown` alerts for the host names of the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

CinderServiceDown¶

Severity	Minor
Summary	The `{{ $labels.binary }}` service on the `{{ $labels.hostname }}` node is down.
Raise condition	`openstack_cinder_service_state == 0`
Description	Raises when a Cinder service on the OpenStack controller or compute node is in the `DOWN` state. For the list of Cinder services, see Cinder Block Storage service overview. The `binary` and `hostname` labels contain the name of the service that is in the `DOWN` state and the node that hosts the service.
Troubleshooting	Verify the list of Cinder services and their states using `openstack volume service list`. Verify the status of the corresponding Cinder service on the affected node using `systemctl service <binary>`. Inspect the logs of the corresponding Cinder service on the affected node in the `/var/log/cinder/` directory. Verify the Telegraf `monitoring_remote_agent` service: Verify the status of the `monitoring_remote_agent` service using `docker service ls`. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

CinderServicesDownMinor¶

Severity	Minor
Summary	More than 30% of `{{ $labels.binary }}` services are down.
Raise condition	`count by(binary) (openstack_cinder_service_state == 0) >= on(binary) count by(binary) (openstack_cinder_service_state) * 0.3`
Description	Raises when a Cinder service is in the `DOWN` state on more than 30% of the `ctl` or `cmp` hosts. For the list of services, see Cinder Block Storage service overview. Inspect the `hostname` label in the `CinderServiceDown` alerts for details on the affected services and nodes.
Troubleshooting	Verify the list of Cinder services and their states using `openstack volume service list`. Verify the status of the corresponding Cinder service on the affected node using `systemctl service <binary>`. Inspect the logs of the corresponding Cinder service on the affected node in the `/var/log/cinder/` directory. Verify the Telegraf `monitoring_remote_agent` service: Verify the status of the `monitoring_remote_agent` service using `docker service ls`. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

CinderServicesDownMajor¶

Severity	Major
Summary	More than 60% of `{{ $labels.binary }}` services are down.
Raise condition	`count by(binary) (openstack_cinder_service_state == 0) >= on(binary) count by(binary) (openstack_cinder_service_state) * 0.6`
Description	Raises when a Cinder service is in the `DOWN` state on more than 60% of the `ctl` or `cmp` hosts. For the list of services, see Cinder Block Storage service overview. Inspect the `hostname` label in the `CinderServiceDown` alerts for details on the affected services and nodes.
Troubleshooting	Verify the list of Cinder services and their states using `openstack volume service list`. Verify the status of the corresponding Cinder service on the affected node using `systemctl service <binary>`. Inspect the logs of the corresponding Cinder service on the affected node in the `/var/log/cinder/` directory. Verify the Telegraf `monitoring_remote_agent` service: Verify the status of the `monitoring_remote_agent` service using `docker service ls`. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

CinderServiceOutage¶

Severity	Critical
Summary	All `{{ $labels.binary }}` services are down.
Raise condition	`count by(binary) (openstack_cinder_service_state == 0) == on(binary) count by(binary) (openstack_cinder_service_state)`
Description	Raises when a Cinder service is in the `DOWN` state on all `ctl` or `cmp` hosts. For the list of services, see Cinder Block Storage service overview. Inspect the `hostname` label in the `CinderServiceDown` alerts for details on the affected services and nodes.
Troubleshooting	Verify the list of Cinder services and their states using `openstack volume service list`. Verify the status of the corresponding Cinder service on the affected node using `systemctl service <binary>`. Inspect the logs of the corresponding Cinder service on the affected node in the `/var/log/cinder/` directory. Verify the Telegraf `monitoring_remote_agent` service: Verify the status of the `monitoring_remote_agent` service using `docker service ls`. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

CinderVolumeProcessDown¶

^{Available starting from the 2019.2.8 maintenance update}

Severity	Minor
Summary	A `cinder-volume` process is down.
Raise condition	`procstat_running{process_name="cinder-volume"} == 0`
Description	Raises when a `cinder-volume` process on a node is down. The `host` label in the raised alert contains the affected node.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status cinder-volume`. Inspect the `cinder-volume` log files in the `/var/log/cinder/` directory.
Tuning	Not required

CinderVolumeProcessesDownMinor¶

^{Available starting from the 2019.2.8 maintenance update}

Severity	Minor
Summary	30% of `cinder-volume` processes are down.
Raise condition	`count(procstat_running{process_name="cinder-volume"} == 0) >= count (procstat_running{process_name="cinder-volume"}) * {{ minor_threshold }}`
Description	Raises when at least one `cinder-volume` process is down. Includes the number of `cinder-volume` processes in the `DOWN` state (`>= {%- endraw %}{{minor_threshold*100}}%{%- raw %}`).
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status cinder-volume`. Inspect the `cinder-volume` log files in the `/var/log/cinder/` directory.
Tuning	Not required

CinderVolumeProcessesDownMajor¶

^{Available starting from the 2019.2.8 maintenance update}

Severity	Major
Summary	60% of `cinder-volume` processes are down.
Raise condition	`count(procstat_running{process_name="cinder-volume"} == 0) >= count (procstat_running{process_name="cinder-volume"}) * {{ major_threshold }}`
Description	Raises when at least two `cinder-volume` processes are down. Includes the number of `cinder-volume` processes in the `DOWN` state (`>= {%- endraw %}{{major_threshold*100}}%{%- raw %}`).
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status cinder-volume`. Inspect the `cinder-volume` log files in the `/var/log/cinder/` directory.
Tuning	Not required

CinderVolumeServiceOutage¶

^{Available starting from the 2019.2.8 maintenance update}

Severity	Critical
Summary	The `cinder-volume` service is down.
Raise condition	`count(procstat_running{process_name="cinder-volume"} == 0) == count (procstat_running{process_name="cinder-volume"})`
Description	Raises when all `cinder-volume` processes are down.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status cinder-volume`. Inspect the `cinder-volume` log files in the `/var/log/cinder/` directory.
Tuning	Not required

CinderErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Cinder logs on the `{{ $labels.host }}` node is larger than 0.2 messages.
Raise condition	`sum without(level) (rate(log_messages{level=~"(?i:(error\|emergency\|fatal))", service="cinder"}[5m])) > 0.2`
Description	Raises when the average per-second rate of `error`, `fatal`, or `emergency` messages in Cinder logs on the node is more than 0.2 per second. The `host` label in the raised alert contains the affected node. Fluentd forwards all logs from Cinder to Elasticsearch and counts the number of log messages per severity.
Troubleshooting	Inspect the log files in the `/var/log/cinder/` directory on the corresponding node. Inspect Cinder logs in the Kibana web UI.
Tuning description	Typically, you should not change the default value. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. To change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CinderErrorLogsTooHigh: if: >- sum(rate(log_messages{service="cinder", \ level=~"(?i:(error\|emergency\|fatal))"}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Glance¶

This section describes the alerts for Glance.

GlanceApiOutage
GlareApiOutage
GlanceApiEndpointDown
GlanceApiEndpointsDownMajor
GlanceApiEndpointsOutage
GlanceErrorLogsTooHigh

GlanceApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Glance API is not accessible for the Glance endpoint in the OpenStack service catalog.
Raise condition	`openstack_api_check_status{name="glance"} == 0`
Description	Raises when the checks against all available internal Glance endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Glance are `200` and `300`.
Troubleshooting	Obtain the list of available endpoints using `openstack endpoint list` and verify the availability of internal Glance endpoints (URLs) from the list.
Tuning	Not required

GlareApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Glare API is not accessible for the Glare endpoint in the OpenStack service catalog.
Raise condition	`openstack_api_check_status{name="glare"} == 0`
Description	Raises when the checks against all available internal Glare endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Glare are `200` and `300`.
Troubleshooting	Obtain the list of available endpoints using `openstack endpoint list` and verify the availability of internal Glance endpoints (URLs) from the list.
Tuning	Not required

GlanceApiEndpointDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"glance.*"} == 0`
Description	Raises when the check against the Glance API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. The `host` label in the raised alert contains the host name of the affected node. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

GlanceApiEndpointsDownMajor¶

Severity	Major
Summary	More than 50% of `{{ $labels.name }}` endpoints are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"glance."} == 0) >= count by(name) (http_response_status{name=~"glance."}) * 0.5`
Description	Raises when the check against the Glance API endpoint does not pass on more than 50% of the `ctl` nodes, typically meaning that the service endpoint is down or unreachable due to connectivity issues. For details on the affected nodes, see the `host` label in the `GlanceApiEndpointDown` alerts. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

GlanceApiEndpointsOutage¶

Severity	Critical
Summary	All available `{{ $labels.name }}` endpoints are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"glance."} == 0) == count by(name) (http_response_status{name=~"glance."})`
Description	Raises when the check against the Glance API endpoint does not pass on all controller nodes, typically meaning that the service endpoint is down or unreachable due to connectivity issues. For details on the affected nodes, see the `host` label in the `GlanceApiEndpointDown` alerts. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

GlanceErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Glance logs on the `{{ $labels.host }}` node is `{{ $value }}` (as measured over the last 5 minutes).
Raise condition	`sum without(level) (rate(log_messages{level=~"(?i:(error\|emergency\|fatal))",service="glance"}[5m])) > 0.2`
Description	Raises when the average per-second rate of `error`, `fatal` or `emergency` messages in Glance logs on the node is more than 0.2 messages per second. The `host` label in the raised alert contains the affected node. Fluentd forwards all logs from Glance to Elasticsearch and counts the number of log messages by severity.
Troubleshooting	Inspect the log files in `/var/log/glance/` on the corresponding node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Glance error logs in Kibana and adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: GlanceErrorLogsTooHigh: if: >- sum(rate(log_messages{service="glance", \ level=~"(?i:(error\|emergency\|fatal))"}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Heat¶

This section describes the alerts for Heat.

HeatApiOutage
HeatApiDown
HeatApiEndpointDown
HeatApiEndpointsDownMajor
HeatApiEndpointsOutage
HeatErrorLogsTooHigh

HeatApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Heat API is not accessible for all available Heat endpoints in the OpenStack service catalog.
Raise condition	`max(openstack_api_check_status{name=~"heat.*"}) == 0`
Description	Raises when the checks against all available internal Heat endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are `200` and `300` for Heat and `200`, `300`, and `400` for Heat CFN. For a list of all available endpoints run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Heat endpoints (URLs) from the output of `openstack endpoint list`.
Tuning	Not required

HeatApiDown¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Major
Summary	Heat API is not accessible for the `{{ $labels.name }}` endpoint.
Raise condition	`openstack_api_check_status{name=~"heat.*"} == 0`
Description	Raises when the checks against one of the available internal Heat endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the Openstack service catalog and compares the expected and actual HTTP response codes. The expected response codes are `200` and `300` for Heat and `200`, `300`, and `400` for Heat CFN. For a list of all available endpoints run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Heat endpoints (URLs) from the output of `openstack endpoint list`.
Tuning	Not required

HeatApiEndpointDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"heat.*-api"} == 0`
Description	Raises when the check against a Heat API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. The `host` label in the raised alert contains the hostname of the affected node. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

HeatApiEndpointsDownMajor¶

Severity	Major
Summary	`{{ $value }} {{ $labels.name }}` endpoints (>= 50%) are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"heat.-api"} == 0) >= count by(name) (http_response_status{name=~"heat.-api"}) * 0.5`
Description	Raises when the check against a Heat API endpoint does not pass on more than 50% of the `ctl` nodes, typically meaning that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `HeatApiEndpointDown` alerts for the host names of the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

HeatApiEndpointsOutage¶

Severity	Critical
Summary	All available `{{ $labels.name }}` endpoints are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"heat.-api"} == 0) == count by(name) (http_response_status{name=~"heat.-api"})`
Description	Raises when the check against a Heat API endpoint does not pass on all OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `HeatApiEndpointDown` alerts for the host names of the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

HeatErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Heat logs on the `{{ $labels.host }}` node is `{{ $value }}` as measured over the last 5 minutes.
Raise condition	`sum without(level) (rate(log_messages{level=~"(?i:(error\|emergency\| fatal))",service="heat"}[5m])) > 0.2`
Description	Raises when the average per-second rate of `error`, `fatal` or `emergency` messages in Heat logs on the node is more than 0.2 per second. Fluentd forwards all logs from Cinder to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the affected node.
Troubleshooting	Inspect the log files in the `/var/log/heat/` directory on the corresponding node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Heat error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: HeatErrorLogsTooHigh: if: >- sum(rate(log_messages{service=""heat"", level=~""(?i:\ (error\|emergency\|fatal))""}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Ironic¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes the alerts for Ironic.

IronicErrorLogsTooHigh
IronicProcessDown
IronicProcessDownMinor
IronicProcessDownMajor
IronicProcessOutage
IronicDriversMissing
IronicApiEndpointDown
IronicApiEndpointsDownMajor
IronicApiEndpointsOutage
IronicApiOutage

IronicErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Ironic logs on the `{{ $labels.host }}` node is `{{ $value }}` (as measured over the last 5 minutes).
Raise condition	`sum(rate(log_messages{service="ironic",level=~"(?i:(error\|emergency\| fatal))"}[5m])) without (level) > 0.2`
Description	Raises when the average per-second rate of `error`, `fatal`, or `emergency` messages in Ironic logs on the node is more than 0.2 per second, which is approximately 1 message per 5 seconds for all nodes in the cluster. Fluentd forwards all logs from Ironic to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the log files in the `/var/log/ironic/` directory on the affected node.
Tuning	For example, to change the threshold to 0.1 (one error per every 10 seconds for the entire cluster): On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: IronicErrorLogsTooHigh: if: >- sum(rate(log_messages{service=""ironic"",level=~""(?i:\ (error\|emergency\|fatal))""}[5m])) without (level) > 0.1 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

IronicProcessDown¶

Severity	Minor
Summary	The `{{ $labels.process_name }}` process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name=~"ironic-.*"} == 0`
Description	Raises when an Ironic process (API or conductor) on a host is down. The `process_name` and `host` labels contain the name of the affected process and the affected node.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicProcessDownMinor¶

Severity	Minor
Summary	The `{{ $labels.process_name }}` process is down on 33% of nodes.
Raise condition	`count(procstat_running{process_name=~"ironic-."} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-."}) by (process_name) * 0.33`
Description	Raises when 33% of Ironic processes (API or conductor) are down. The `process_name` label contains the name of the affected processes.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicProcessDownMajor¶

Severity	Major
Summary	The `{{ $labels.process_name }}` process is down on 66% of nodes.
Raise condition	`count(procstat_running{process_name=~"ironic-."} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-."}) by (process_name) * 0.66`
Description	Raises when 66% of Ironic processes (API or conductor) are down. The `process_name` label contains the name of the affected processes.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicProcessOutage¶

Severity	Critical
Summary	The `{{ $labels.process_name }}` process is down on all nodes.
Raise condition	`count(procstat_running{process_name=~"ironic-."} == 0) by (process_name) == count(procstat_running{process_name=~"ironic-."}) by (process_name)`
Description	All specified Ironic processes (API or conductor) are down. The `process_name` label contains the name of the affected processes.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicDriversMissing¶

Severity	Major
Summary	The `ironic-conductor` `{{ $labels.driver }}` back-end driver is missing on `{{ $value }}` node(s).
Raise condition	`scalar(count(procstat_running{process_name=~"ironic-conductor"} == 1)) - count(openstack_ironic_driver) by (driver) > 0`
Description	Raises when Ironic conductors have a different number of back-end drivers enabled. The cluster performance is not affected. However, the cluster may loose HA.
Troubleshooting	Inspect the Drivers panel of the Ironic Grafana dashboard for the nodes that have the disabled driver.
Tuning	Not required

IronicApiEndpointDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"ironic-api.*"} == 0`
Description	Raises when an Ironic API endpoint (deploy or public API) was not responding to HTTP health checks for 2 minutes. The `name` and `host` labels contain the name of the affected endpoint and the affected node.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicApiEndpointsDownMajor¶

Severity	Major
Summary	`{{ $value }}` of `{{ $labels.name }}` endpoints (>= 50%) are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"ironic-api."} == 0) by (name) >= count(http_response_status{name=~"ironic-api."}) by (name) * 0.5`
Description	Raises when at least 50% of Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The `name` label contains the name of the affected endpoint.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicApiEndpointsOutage¶

Severity	Critical
Summary	All available `{{ $labels.name }}` endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"ironic-api."} == 0) by (name) == count(http_response_status{name=~"ironic-api."}) by (name)`
Description	Raises when all Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The `name` label contains the name of the affected endpoint.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Ironic API is not accessible for all available Ironic endpoints in the OpenStack service catalog for 2 minutes.
Raise condition	`max(openstack_api_check_status{service="ironic"}) == 0`
Description	Raises when the Ironic API or conductor service is in the `DOWN` state on all `ctl` or `bmt` hosts. For the exact nodes and services, inspect the `host` and `process_name` labels of the `IronicProcessDown` alerts.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory. Verify the Telegraf `monitoring_remote_agent` service: Verify the status of the `monitoring_remote_agent` service using `docker service ls`. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

Keystone¶

This section describes the alerts for Keystone.

KeystoneApiOutage
KeystoneApiEndpointDown
KeystoneApiEndpointssDownMajor
KeystoneApiEndpointsOutage
KeystoneErrorLogsTooHigh
KeystoneApiResponseTimeTooHigh
KeystoneKeysRotationFailure

KeystoneApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Keystone API is not accessible for the Keystone endpoint in the OpenStack service catalog.
Raise condition	`openstack_api_check_status{name=~"keystone.*"} == 0`
Description	Raises when the checks against all available internal Keystone endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Keystone are `200` and `300`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Keystone endpoints (URLs) from the output of `openstack endpoint list`.
Tuning	Not required

KeystoneApiEndpointDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"keystone.*"} == 0`
Description	Raises when the check against the Keystone API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

KeystoneApiEndpointssDownMajor¶

Severity	Major
Summary	`{{ $value }}` `{{ $labels.name }}` endpoints (>= 50%) are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"keystone."} == 0) >= count by(name) (http_response_status{name=~"keystone."}) * 0.5`
Description	Raises when the check against a Keystone API endpoint does not pass on more than 50% of the `ctl` nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `KeystoneApiEndpointDown` for the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

KeystoneApiEndpointsOutage¶

Severity	Critical
Summary	All available `{{ $labels.name }}` endpoints are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"keystone."} == 0) == count by(name) (http_response_status{name=~"keystone."})`
Description	Raises when the check against a Keystone API endpoint does not pass on all OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `KeystoneApiEndpointDown` for the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

KeystoneErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in the Keystone logs on the `{{ $labels.host }}` node is `{{ $value }}` (as measured over the last 5 minutes).
Raise condition	In 2019.2.10 and prior: `sum without(level)(rate(log_messages{level=~"(?i:(error\|emergency\| fatal))",service="keystone"}[5m])) > 0.2` In 2019.2.11 and newer: `sum(rate(log_messages{service=~"keystone\|keystone-wsgi",level=~"(?i: (error\|emergency\|fatal))"}[5m])) without (level) > {{ log_threshold }}`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in Keystone logs on the node is more than 0.2 per second. Fluentd forwards all logs from Cinder to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/keystone/` directory on the corresponding node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Keystone error logs in Kibana. You can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: KeystoneErrorLogsTooHigh: if: >- sum(rate(log_messages{service="keystone", level=~"(?i:\ (error\|emergency\|fatal))"}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

KeystoneApiResponseTimeTooHigh¶

Severity	Warning
Summary	The Keystone API response time for GET and POST requests on the `{{ $labels.host }}` node is higher than 3 seconds for 2 minutes.
Raise condition	`max by(host) (openstack_http_response_times{http_method=~"^(GET\|POST) $", http_status=~"^2..$", quantile="0.9", service="keystone"}) >= 3`
Description	Raises when the GET and POST requests to the Keystone API take more than 3 seconds.
Troubleshooting	Verify the performance of the OpenStack controller node.
Tuning	For example, to change the threshold to 5 seconds: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: KeystoneApiResponseTimeTooHigh: if: >- max by(host) (openstack_http_response_times{http_method=~"^\ (GET\|POST)$", http_status=~"^2..$", quantile="0.9", \ service="keystone"}) >= 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

KeystoneKeysRotationFailure¶

^{Available since the 2019.2.11 maintenance update}

Severity	Major
Summary	Keystone keys rotation failure.
Raise condition	`increase(log_messages{service="keystone-keys-rotation",level="ERROR"} [2h]) > 0`
Description	Raises when a Keystone user failed to rotate fernet or credential keys across the OpenStack control nodes. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the `/var/log/keystone/keystone-rotate.log` log on the affected node.
Tuning	Not required

Neutron¶

This section describes the alerts for Neutron.

NeutronApiOutage
NeutronApiEndpointDown
NeutronApiEndpointsDownMajor
NeutronApiEndpointsOutage
NeutronAgentDown
NeutronAgentsDownMinor
NeutronAgentsDownMajor
NeutronAgentsOutage
NeutronErrorLogsTooHigh

NeutronApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Neutron API is not accessible for the Neutron endpoint in the OpenStack service catalog.
Raise condition	`openstack_api_check_status{name="neutron"} == 0`
Description	Raises when the checks against all available internal Neutron endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response code for Neutron is `200`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Neutron endpoints (URLs) from the output of `openstack endpoint list`.
Tuning	Not required

NeutronApiEndpointDown¶

Severity	Minor
Summary	The `neutron-api` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name="neutron-api"} == 0`
Description	Raises when the check against a Neutron API endpoint does not pass, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

NeutronApiEndpointsDownMajor¶

Severity	Major
Summary	More than 50% of `neutron-api` endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name="neutron-api"} == 0) >= count (http_response_status{name="neutron-api"}) * 0.5`
Description	Raises when the check against a Neutron API endpoint does not pass on more than 50% of OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. To identify the affected node, see the `host` label in the `NeutronApiEndpointDown` alert.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

NeutronApiEndpointsOutage¶

Severity	Critical
Summary	All available `neutron-api` endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name="neutron-api"} == 0) == count (http_response_status{name="neutron-api"})`
Description	Raises when the check against a Neutron API endpoint does not pass on all OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. To identify the affected node, see the `host` label in the `NeutronApiEndpointDown` alert.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

NeutronAgentDown¶

Severity	Minor
Summary	The `{{ $labels.binary }}` agent on the `{{ $labels.hostname }}` node is down.
Raise condition	`openstack_neutron_agent_state == 0`
Description	Raises when a Neutron agent is in the `DOWN` state, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf `monitoring_remote_agent` service. The `binary` and `hostname` labels contain the name of the agent that is in the `DOWN` state and the node that hosts the agent.
Troubleshooting	Verify the statuses of Neutron agents using `openstack network agent list`. Verify the status of the `monitoring_remote_agent` by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NeutronAgentsDownMinor¶

Severity	Minor
Summary	More than 30% of `{{ $labels.binary }}` agents are down.
Raise condition	`count by(binary) (openstack_neutron_agent_state == 0) >= on(binary) count by(binary) (openstack_neutron_agent_state) * 0.3`
Description	Raises when more than 30% of Neutron agents of the same type are in the `DOWN` state, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf `monitoring_remote_agent` service. The `binary` label contains the name of the agent that is in the `DOWN` state.
Troubleshooting	Verify the statuses of Neutron agents using `openstack network agent list`. Inspect the `NeutronAgentDown` alert for the nodes and services that are in the `DOWN` state. Verify the status of the `monitoring_remote_agent` by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

NeutronAgentsDownMajor¶

Severity	Major
Summary	More than 60% of `{{ $labels.binary }}` agents are down.
Raise condition	`count by(binary) (openstack_neutron_agent_state == 0) >= on(binary) count by(binary) (openstack_neutron_agent_state) * 0.6`
Description	Raises when more than 60% of Neutron agents of the same type are in the `DOWN` state, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf `monitoring_remote_agent` service. The `binary` label contains the name of the agent that is in the `DOWN` state.
Troubleshooting	Verify the statuses of Neutron agents using `openstack network agent list`. Inspect the `NeutronAgentDown` alert for the nodes and services that are in the `DOWN` state. Verify the status of the `monitoring_remote_agent` by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

NeutronAgentsOutage¶

Severity	Critical
Summary	All `{{ $labels.binary }}` agents are down.
Raise condition	`count by(binary) (openstack_neutron_agent_state == 0) == on(binary) count by(binary) (openstack_neutron_agent_state)`
Description	Raises when all Neutron agents of the same type are in the `DOWN` state and unavailable, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf `monitoring_remote_agent` service. The `binary` label contains the name of the agent that is in the `DOWN` state.
Troubleshooting	Verify the statuses of Neutron agents using `openstack network agent list`. Inspect the `NeutronAgentDown` alert for the nodes and services that are in the `DOWN` state. Verify the status of the `monitoring_remote_agent` by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

NeutronErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Neutron logs on the `{{ $labels.host }}` node is `{{ $value }}` (as measured over the last 5 minutes).
Raise condition	`sum without(level) (rate(log_messages{level=~"(?i:(error\|emergency\| fatal))",service="neutron"}[5m])) > 0.2`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in Neutron logs on the node is more than 0.2 per second. Fluentd forwards all logs from Neutron to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Neutron logs in the `/var/log/neutron/` directory on the affected node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Neutron error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert by overriding the `if` parameter: parameters: prometheus: server: alert: NeutronErrorLogsTooHigh: if: >- sum(rate(log_messages{service="neutron", level=~"(?i:\ (error\|emergency\|fatal))"}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Nova¶

This section describes the alerts for Nova.

Nova service¶

This section describes the Nova API and services alerts.

NovaApiOutage
NovaApiDown
NovaApiEndpointDown
NovaApiEndpointsDownMajor
NovaApiEndpointsOutage
NovaServiceDown
NovaServicesDownMinor

NovaComputeServicesDownMinor
NovaServicesDownMajor
NovaComputeServicesDownMajor
NovaServiceOutage
NovaErrorLogsTooHigh
NovaComputeSystemLoadTooHighWarning
NovaComputeSystemLoadTooHighCritical

NovaApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Nova API is not accessible for all available Nova endpoints in the OpenStack service catalog.
Raise condition	`max(openstack_api_check_status{name=~"nova.*\|placement"}) == 0`
Description	Raises when the checks against all available internal Nova or placement endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are `200` for `nova` and `nova20`, `200` and `401` for `placement`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the states of Nova endpoints from the output of `openstack endpoint list`. Inspect the `NovaApiDown` alert for the nodes and services that are in the `DOWN` state. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaApiDown¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Major
Summary	Nova API is not accessible for the `{{ $labels.name }}` endpoint.
Raise condition	`openstack_api_check_status{name=~"nova.*\|placement"} == 0`
Description	Raises when the checks against one of the available internal Nova or placement endpoint in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are `200` for `nova` and `nova20`, `200` and `401` for `placement`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the states of Nova endpoints from the output of `openstack endpoint list`. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaApiEndpointDown¶

Severity	Minor
Summary	The `nova-api` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"nova-api"} == 0`
Description	Raises when the check against a Nova API endpoint does not pass on a node. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf/`. Verify the configured URL availability using `curl`.
Tuning	Not required

NovaApiEndpointsDownMajor¶

Severity	Major
Summary	`{{ $value }}` `nova-api` endpoints (>= 0.5 * 100) are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"nova-api"} == 0) >= count(http_response_status{name=~"nova-api"}) * 0.5`
Description	Raises when the check against a Nova API endpoint does not pass on more than 50% of OpenStack controller nodes. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected `200` response code and actual HTTP response codes. For details, see HTTP response input plugin.
Troubleshooting	Inspect the `NovaApiEndpointDown` alert for the nodes and services that are in the `DOWN` state. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf/`. Verify the configured URL availability using `curl`.
Tuning	Not required

NovaApiEndpointsOutage¶

Severity	Critical
Summary	All available `nova-api` endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"nova-api"} == 0) == count (http_response_status{name=~"nova-api"})`
Description	Raises when the check against a Nova API endpoint does not pass on all OpenStack controller nodes. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `NovaApiEndpointDown` alert for the nodes and services that are in the `DOWN` state. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf/`. Verify the configured URL availability using `curl`.
Tuning	Not required

NovaServiceDown¶

Severity	Minor
Summary	The `{{ $labels.binary }}` service on the `{{ $labels.hostname }}` node is down.
Raise condition	`openstack_nova_service_state == 0`
Description	Raises when the Nova controller or compute service (`os-service`) is in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview. The `binary` and `hostname` labels in the raised alert contain the service name that is in the `DOWN` state and the affected node name.
Troubleshooting	Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaServicesDownMinor¶

Severity	Minor
Summary	`{{ $value }}` `{{ $labels.binary }}` services (>=0.3 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary) count (openstack_nova_service_state{binary!~"nova-compute"}) by (binary) * 0.3`
Description	Raises when more than 30% of Nova controller services of the same type are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaComputeServicesDownMinor¶

Severity	Minor
Summary	`{{ $value }}` `nova-compute` services (>= 0.25 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * 0.25`
Description	Raises when more than 25% of Nova compute services are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaServicesDownMajor¶

Severity	Major
Summary	`{{ $value }}` `{{ $labels.binary }}` services (>= 0.25 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary)count(openstack_nova_service_state {binary!~"nova-compute"}) by (binary) * 0.6`
Description	Raises when more than 60% of Nova controller services of the same type are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaComputeServicesDownMajor¶

Severity	Major
Summary	`{{ $value }}` `nova-compute` services (>= 0.5 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * 0.5`
Description	Raises when more than 50% of Nova compute services are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaServiceOutage¶

Severity	Critical
Summary	All `{{ $labels.binary }}` services are down.
Raise condition	`count(openstack_nova_service_state == 0) by (binary) == on (binary) count(openstack_nova_service_state) by (binary)`
Description	Raises when Nova controller or compute services of the same type are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview. The `binary` and `hostname` labels in the raised alert contain the service name that is in the `DOWN` state and the affected node name.
Troubleshooting	Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in the Nova logs on the `{{ $labels.host }}` node is more than 0.2 messages per second (as measured over the last 5 minutes).
Raise condition	`sum(rate(log_messages{service="nova",level=~"(?i:(error\|emergency\| fatal))"}[5m])) without (level) > 0.2`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in Nova logs on the node is more than 0.2 per second. Fluentd forwards all logs from Nova to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/nova/` directory of the affected node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Nova error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NovaErrorLogsTooHigh: if: >- sum(rate(log_messages{service="nova", level=~"(?i:\ (error\|emergency\|fatal))"}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

NovaComputeSystemLoadTooHighWarning¶

^{Available starting from the 2019.2.9 maintenance update}

Severity	Warning
Summary	The system load per CPU on the `{{ $labels.host }}` node is more than `1` for 5 minutes.
Raise condition	`system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.0`
Description	Raises when the average load on an OpenStack compute node is higher than `1` per CPU core over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `uptime` and `top` commands on the affected node.
Tuning	For example, to change the threshold to `1.5` per core: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NovaComputeSystemLoadTooHighWarning: if: >- system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

NovaComputeSystemLoadTooHighCritical¶

^{Available starting from the 2019.2.9 maintenance update}

Severity	Critical
Summary	The system load per CPU on the `{{ $labels.host }}` node is more than `2` for 5 minutes.
Raise condition	`system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 2.0`
Description	Raises when the average load on an OpenStack compute node is higher than `2` per CPU over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `uptime` and `top` commands on the affected node.
Tuning	For example, to change the threshold to `3` per core: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NovaComputeSystemLoadTooHighCritical: if: >- system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 3 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Nova resources¶

This section describes the alerts for Nova resources consumption.

Warning

The following set of alerts has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable these alerts as described in Manage alerts.

NovaHypervisorVCPUsFullMinor
NovaHypervisorVCPUsFullMajor
NovaHypervisorMemoryFullMajor
NovaHypervisorMemoryFullCritical
NovaHypervisorDiskFullMajor
NovaHypervisorDiskFullCritical
NovaAggregateMemoryFullMajor
NovaAggregateMemoryFullCritical

NovaAggregateDiskFullMajor
NovaAggregateDiskFullCritical
NovaTotalVCPUsFullMinor
NovaTotalVCPUsFullMajor
NovaTotalMemoryFullMajor
NovaTotalMemoryFullCritical
NovaTotalDiskFullMajor
NovaTotalDiskFullCritical

NovaHypervisorVCPUsFullMinor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Minor
Summary	`{{ $value }}` VCPUs on the `{{ $labels.hostname }}` node (>= `{{ cpu_minor_threshold * 100 }}%`) are used.
Raise condition	`label_replace(system_load15, "hostname", "$1", "host", "(.)") > on (hostname) openstack_nova_vcpus 0.85`
Description	Raises when the hypervisor consumes more than 85% of the available VCPU (the average load for 15 minutes), according to the data from Nova API and the load average from `/proc/loadavg` on the appropriate node. For details, see Hypervisors. The `hostname` label in the raised alert contains the affected node name.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaHypervisorVCPUsFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}` VCPUs on the `{{ $labels.hostname }}` node (>= `{{ cpu_major_threshold * 100 }}%`) are used.
Raise condition	`label_replace(system_load15, "hostname", "$1", "host", "(.)") > on (hostname) openstack_nova_vcpus 0.95`
Description	Raises when the hypervisor consumes more than 95% of the available VCPU (the average load for 15 minutes), according to the data from Nova API and the load average from `/proc/loadavg` on the appropriate node. For details, see Hypervisors. The `hostname` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaHypervisorMemoryFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}MB` of RAM on the `{{ $labels.hostname }}` node (>= `{{ ram_major_threshold * 100 }}%`) is used.
Raise condition	`openstack_nova_used_ram > openstack_nova_ram * 0.85`
Description	Raises when the hypervisor allocates more than 85% of the available RAM, according to the data from Nova API. For details, see Hypervisors. The `hostname` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaHypervisorMemoryFullCritical¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Critical
Summary	`{{ $value }}MB` of RAM on the `{{ $labels.hostname }}` node (>= `{{ ram_critical_threshold * 100 }}%`) is used.
Raise condition	`openstack_nova_used_ram > openstack_nova_ram * 0.95`
Description	Raises when the hypervisor allocates more than 95% of the available RAM, according to the data from Nova API. For details, see Hypervisors. The `hostname` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaHypervisorDiskFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}GB` of disk space on the `{{ $labels.hostname }}` node (>= `{{ disk_major_threshold * 100 }}%`) is used.
Raise condition	`openstack_nova_used_disk > openstack_nova_disk * 0.85`
Description	Raises when the hypervisor allocates more than 85% of the available disk space, according to the data from Nova API. For details, see Hypervisors. The `hostname` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaHypervisorDiskFullCritical¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Critical
Summary	`{{ $value }}GB` of disk space on the `{{ $labels.hostname }}` node (>= `{{ disk_critical_threshold *100 }}%`) is used.
Raise condition	`openstack_nova_used_disk > openstack_nova_disk * 0.95`
Description	Raises when the hypervisor allocates more than 95% of the available disk space, according to the data from Nova API. For details, see Hypervisors. The `hostname` label in the raised alert contains the name of the affected node.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaAggregateMemoryFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}MB` of RAM on the `{{ $labels.aggregate }}` aggregate is used (at least `{{ ram_major_threshold * 100}}%`).
Raise condition	`openstack_nova_aggregate_used_ram > openstack_nova_aggregate_ram * 0.85`
Description	Raises when the RAM allocation over all hypervisors within a host aggregate is more than 85% of the total available RAM, according to the data from Nova API. For details, see Hypervisors and Host aggregates. The `aggregate` label in the raised alert contains the name of the affected host aggregate.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the list of host aggregate members using the `openstack aggregate list` and `openstack aggregate show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaAggregateMemoryFullCritical¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Critical
Summary	`{{ $value }}MB` of RAM on the `{{ $labels.aggregate }}` aggregate (>= `{{ ram_critical_threshold * 100 }}%`) is used.
Raise condition	`openstack_nova_aggregate_used_ram > openstack_nova_aggregate_ram * 0.95`
Description	Raises when the RAM allocation over all hypervisors within a host aggregate is more than 95% of the total available RAM, according to the data fromNova API. For details, see Hypervisors and Host aggregates. The `aggregate` label in the raised alert contains the name of the affected host aggregate.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the list of host aggregate members using the `openstack aggregate list` and `openstack aggregate show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaAggregateDiskFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}GB` of disk space on the `{{ $labels.aggregate }}` aggregate (>= `{{ disk_major_threshold *100 }}%`) is used.
Raise condition	`openstack_nova_aggregate_used_disk > openstack_nova_aggregate_disk * 0.85`
Description	Raises when the disk space allocation over all hypervisors within a host aggregate is more than 95% of the total available disk space, according to the data from Nova API. For details, see Hypervisors and Host aggregates. The `aggregate` label in the raised alert contains the name of the affected host aggregate.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the list of host aggregate members using the `openstack aggregate list` and `openstack aggregate show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaAggregateDiskFullCritical¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Critical
Summary	`{{ $value }}GB` of disk space on the `{{ $labels.aggregate }}` aggregate (>= `{{ disk_critical_threshold *100 }}%`) is used.
Raise condition	`openstack_nova_aggregate_used_disk > openstack_nova_aggregate_disk * 0.95`
Description	Raises when the disk space allocation over all hypervisors within a host aggregate is more than 95% of the total available disk space over all hypervisors within the host aggregate, according to the data from Nova API. For details, see Hypervisors and Host aggregates. The `aggregate` label in the raised alert contains the name of the affected host aggregate.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the list of host aggregate members using the `openstack aggregate list` and `openstack aggregate show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaTotalVCPUsFullMinor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Minor
Summary	`{{ $value }}` VCPUs in the cloud (>= `{{ cpu_minor_threshold * 100 }}%`) are used.
Raise condition	`sum(label_replace(system_load15, "hostname", "$1", "host", "(.)") and on (hostname) openstack_nova_vcpus) > max(sum(openstack_nova_vcpus) by (instance)) 0.85`
Description	Raises when the VCPU consumption over all hypervisors (the average load for 15 minutes) is more than 85% of the total available VCPU, according to the data from Nova API and `/proc/loadavg` on the appropriate node. For details, see Hypervisors.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaTotalVCPUsFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}` VCPUs in the cloud (>= `{{cpu_major_threshold * 100 }}%`) are used.
Raise condition	`sum(label_replace(system_load15, "hostname", "$1", "host", "(.)") and on (hostname) openstack_nova_vcpus) > max(sum(openstack_nova_vcpus) by (instance)) 0.95`
Description	Raises when the VCPU consumption over all hypervisors (the average load for 15 minutes) is more than 95% of the total available VCPU, according to the data from Nova API and `/proc/loadavg` on the appropriate node. For details, see Hypervisors.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaTotalMemoryFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}MB` of RAM in the cloud (>= `{{ram_major_threshold * 100}}%`) is used.
Raise condition	`openstack_nova_total_used_ram > openstack_nova_total_ram * 0.85`
Description	Raises when the RAM allocation over all hypervisors is more than 85% of the total available RAM, according to the data from Nova API. For details, see Hypervisors.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaTotalMemoryFullCritical¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Critical
Summary	`{{ $value }}MB` of RAM in the cloud (>= `{{ram_critical_threshold * 100 }}%`) is used.
Raise condition	`openstack_nova_total_used_ram > openstack_nova_total_ram * 0.95`
Description	Raises when the RAM allocation over all hypervisors is more than 95% of the total available RAM, according to the data from Nova API. For details, see Hypervisors.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaTotalDiskFullMajor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Major
Summary	`{{ $value }}GB` of disk space in the cloud (>= `{{disk_major_threshold * 100 }}%`) is used.
Raise condition	`openstack_nova_total_used_disk > openstack_nova_total_disk * 0.85`
Description	Raises when the disk space allocation over all hypervisors is more than 85% of the total disk space, according to the data from Nova API. For details, see Hypervisors.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

NovaTotalDiskFullCritical¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Critical
Summary	`{{ $value }}GB` of disk space in the cloud (>= `{{disk_critical_threshold * 100 }}%`) is used.
Raise condition	`openstack_nova_total_used_disk > openstack_nova_total_disk * 0.95`
Description	Raises when the disk space allocation over all hypervisors is more than 95% of the total disk space, according to the data from Nova API. For details, see Hypervisors.
Troubleshooting	Verify the hypervisor capacity using the `openstack hypervisor list` or `openstack hypervisor show` commands. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Disable the alert as described in Manage alerts.

Octavia¶

This section describes the alerts for Octavia.

OctaviaApiDown
OctaviaErrorLogsTooHigh

OctaviaApiDown¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Octavia API is not accessible for all available Octavia endpoints in the OpenStack service catalog for 2 minutes.
Raise condition	`max(openstack_api_check_status{service="octavia-api"}) == 0`
Description	Raises when the checks against one available internal Octavia endpoint in the OpenStack service catalog does not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response code for Octavia is `200`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Octavia endpoints (URLs) from the output of the `openstack endpoint list` command.
Tuning	Not required

OctaviaErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Octavia logs on the `{{ $labels.host }}` node is more than 0.2 error messages per second (as measured over the last 5 minutes).
Raise condition	`sum(rate(log_messages{service="octavia",level=~"error\|emergency\| fatal"}[5m])) without (level) > 0.2`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in the Octavia logs on the node is more than 0.2 per second. Fluentd forwards all logs from Octavia to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/octavia/` directory on the affected node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Octavia error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: OctaviaErrorLogsTooHigh: if: >- sum(rate(log_messages{service=""octavia"", level=~""(?i:\ (error\|emergency\|fatal))""}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Kubernetes¶

This section describes the alerts available for a Kubernetes cluster.

Calico¶

This section describes the alerts for Calico.

CalicoProcessDown
CalicoProcessDownMinor
CalicoProcessDownMajor
CalicoProcessOutage

CalicoProcessDown¶

Severity	Minor
Summary	The Calico `{{ $labels.process_name }}` process on the `{{ $labels.host }}` node is down for 2 minutes.
Raise condition	`procstat_running{process_name=~"calico-felix\|bird\|bird6\|confd"} == 0`
Description	Raises when Telegraf cannot find running processes with names `calico-felix`, `bird`, `bird6`, `confd` on any `ctl` host. The `process_name` and `host` labels in the raised alert contain the name of a particular process and the host name of the affected node respectively.
Troubleshooting	Inspect `dmesg` and `/var/log/kern.log` Inspect the logs in `/var/log/calico` Inspect the output of the `systemctl status containerd` and `journalctl -u containerd` commands
Tuning	Not required

CalicoProcessDownMinor¶

Severity	Minor
Summary	More than 30% of Calico `{{ $labels.process_name }}` processes are down for 2 minutes.
Raise condition	`count(procstat_running{process_name=~"calico-felix\|bird\|bird6\|confd"} == 0) by (process_name) > count(procstat_running{process_name=~"calico-felix\|bird\|bird6\|confd"}) by (process_name) * {{ instance_minor_threshold_percent }}`
Description	Raises when Telegraf cannot find running processes with names `calico-felix`, `bird`, `bird6`, `confd` on more than 30% of the `ctl` hosts. The `process_name` label in the raised alert contains the name of a particular process.
Troubleshooting	Inspect the `CalicoProcessDown` alerts for the host names of the affected nodes Inspect `dmesg` and `/var/log/kern.log` Inspect the logs in `/var/log/calico` Inspect the output of the `systemctl status containerd` and `journalctl -u containerd` commands
Tuning	Not required

CalicoProcessDownMajor¶

Severity	Major
Summary	More than 60% of Calico `{{ $labels.process_name }}` processes are down for 2 minutes.
Raise condition	`count(procstat_running{process_name=~"calico-felix\|bird\|bird6\|confd"} == 0) by (process_name) > count(procstat_running{process_name=~"calico-felix\|bird\|bird6\|confd"}) by (process_name) * {{ instance_major_threshold_percent }}`
Description	Raises when Telegraf cannot find running processes with names `calico-felix`, `bird`, `bird6`, `confd` on more than 60% of `ctl` hosts. The `process_name` label in the raised alert contains the name of a particular process.
Troubleshooting	Inspect the `CalicoProcessDown` alerts for host names of the affected nodes Inspect `dmesg` and `/var/log/kern.log` Inspect the logs in `/var/log/calico` Inspect the output of the `systemctl status containerd` and `journalctl -u containerd` commands
Tuning	Not required

CalicoProcessOutage¶

Severity	Critical
Summary	All Calico `{{ $labels.process_name }}` processes are down for 2 minutes.
Raise condition	`count(procstat_running{process_name=~"calico-felix\|bird\|bird6\|confd"}) by (process_name) == count(procstat_running{process_name=~"calico-felix\|bird\|bird6\|confd"} == 0) by (process_name)`
Description	Raises when Telegraf cannot find running processes with names `calico-felix`, `bird`, `bird6`, `confd` on all `ctl` hosts. The `process_name` label in the raised alert contains the name of a particular process.
Troubleshooting	Verify the `CalicoProcessDown` alerts for host names of the affected nodes Inspect `dmesg` and `/var/log/kern.log` Inspect the logs in `/var/log/calico` Inspect the output of the `systemctl status containerd` and `journalctl -u containerd` commands
Tuning	Not required

etcd¶

This section describes the alerts for the etcd service.

EtcdRequestFailureTooHigh
EtcdInstanceNoLeader
EtcdServiceDownMinor
EtcdServiceDownMajor
EtcdServiceOutage

EtcdRequestFailureTooHigh¶

Severity	Minor
Summary	More than 1% of HTTP `{{ $labels.method }}` requests to the etcd API failed on the `{{ $labels.instance }}` instance.
Raise condition	`sum by(method) (rate(etcd_http_failed_total[5m])) / sum by(method) (rate(etcd_http_received_total[5m])) > 0.01`
Description	Raises when the total percentage rate of failed HTTP requests from the client to etcd is higer than 1%. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the etcd service status on the affected node using `systemctl status etcd`. Inspect the etcd service logs using `journalctl -u etcd`.
Tuning	For example, to change the threshold to `2%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: EtcdRequestFailureTooHigh: if: >- sum by(method) (rate(etcd_http_failed_total[5m])) / sum\ by(method) (rate(etcd_http_received_total[5m])) > 0.02 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

EtcdInstanceNoLeader¶

Severity	Major
Summary	The etcd `{{ $labels.instance }}` instance has no leader.
Raise condition	`etcd_server_has_leader != 1`
Description	Raises when the etcd server reports that it has no leader.
Troubleshooting	Verify all etcd services on the `ctl` nodes: Verify the etcd service status using `systemctl status etcd`. Inspect the etcd service logs using `journalctl -u etcd`.
Tuning	Not required

EtcdServiceDownMinor¶

Severity	Minor
Summary	The etcd `{{ $labels.instance }}` instance is down for 2 minutes.
Raise condition	`up{job='etcd'} == 0`
Description	Raises when Prometheus fails to scrape the etcd target. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the availability of the etcd target on the `mon` nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI. Verify the etcd service status on the affected nodes using `systemctl status etcd` and `journalctl -u etcd`.
Tuning	Not required

EtcdServiceDownMajor¶

Severity	Major
Summary	More than 30% of etcd instances are down for 2 minutes.
Raise condition	`count(up{job='etcd'} == 0) > count(up{job='etcd'}) * 0.3`
Description	Raises when Prometheus fails to scrape more than 30% of etcd targets. Inspect the `EtcdServiceDownMinor` alerts for the host names of the affected nodes.
Troubleshooting	Verify the availability of the etcd target on the `mon` nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI. Verify the etcd service status on the affected nodes using `systemctl status etcd` and `journalctl -u etcd`.
Tuning	Not required

EtcdServiceOutage¶

Severity	Critical
Summary	All etcd services within the cluster are down.
Raise condition	`count(up{job='etcd'} == 0) == count(up{job='etcd'})`
Description	Raises when Prometheus fails to scrape all etcd targets. Inspect the `EtcdServiceDownMinor` alerts for the host names of the affected nodes.
Troubleshooting	Verify the availability of the etcd target on the `mon` nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI. Verify the etcd service status on the affected nodes using `systemctl status etcd` and `journalctl -u etcd`.
Tuning	Not required

Kubernetes¶

This section describes the alerts for Kubernetes.

ContainerScrapeError
KubernetesProcessDown
KubernetesProcessDownMinor
KubernetesProcessDownMajor
KubernetesProcessOutage

ContainerScrapeError¶

Severity	Warning
Summary	Prometheus was not able to scrape metrics from the container on the `{{ $labels.instance }}` Kubernetes instance.
Raise condition	`container_scrape_error != 0`
Description	Raises when `cadvisor` fails to scrape metrics from a container.
Tuning	Not required

KubernetesProcessDown¶

Severity	Minor
Summary	The Kubernetes `{{ $labels.process_name }}` process on the `{{ $labels.host }}` node is down for 2 minutes.
Raise condition	`procstat_running{process_name=~"hyperkube-.*"} == 0`
Description	Raises when Telegraf cannot find running `hyperkube-kubelet`, `hyperkube-proxy`, `hyperkube-apiserver`, `hyperkube-controller-manager`, and `hyperkube-scheduler` processes on any `ctl` host and `hyperkube-kubelet` or `hyperkube-proxy` processes on any `cmp` host. The `process_name` label in the raised alert contains the process name.
Troubleshooting	Verify the containerd status on the affected node using `systemctl containerd status`. Verify the Docker status on the affected node using `systemctl docker status`. For issues with `cmp`, verify `criproxy` using `systemctl criproxy status`. Inspect the logs in `/var/log/kubernetes.log`.
Tuning	Not required

KubernetesProcessDownMinor¶

Severity	Minor
Summary	`{{ $value }}` Kubernetes `{{ $labels.process_name }}` processes (>= `{{ instance_minor_threshold_percent * 100}}%`) are down for 2 minutes.
Raise condition	`count(procstat_running{process_name=~"hyperkube-."} == 0) by (process_name) > count(procstat_running{process_name=~"hyperkube-."}) by (process_name) * {{ instance_minor_threshold_percent }}`
Description	Raises when Telegraf cannot find running `hyperkube-kubelet`, `hyperkube-proxy`, `hyperkube-apiserver`, `hyperkube-controller-manager`, and `hyperkube-scheduler` processes on more than 30% of the `ctl` or `cmp` hosts. The `process_name` label in the raised alert contains the process name. For the affected nodes, see the `host` label in the `KubernetesProcessDown` alerts.
Troubleshooting	Verify the containerd status on the affected node using `systemctl containerd status`. Verify the Docker status on the affected node using `systemctl docker status`. For issues with `cmp`, verify `criproxy` using `systemctl criproxy status`. Inspect the logs in `/var/log/kubernetes.log`.
Tuning	Not required

KubernetesProcessDownMajor¶

Severity	Major
Summary	`{{ $value }}` Kubernetes `{{ $labels.process_name }}` processes (>= `{{ instance_major_threshold_percent * 100}}%`) are down for 2 minutes.
Raise condition	`count(procstat_running{process_name=~"hyperkube-."} == 0) by (process_name) > count(procstat_running{process_name=~"hyperkube-."}) by (process_name) * {{ instance_major_threshold_percent }}`
Description	Raises when Telegraf cannot find running `hyperkube-kubelet`, `hyperkube-proxy`, `hyperkube-apiserver`, `hyperkube-controller-manager`, and `hyperkube-scheduler` processes on more than 60% of the `ctl` or `cmp` hosts. The `process_name` label in the raised alert contains the process name. For the affected nodes, see the `host` label in the `KubernetesProcessDown` alerts.
Troubleshooting	Verify the containerd status on the affected node using `systemctl containerd status`. Verify the Docker status on the affected node using `systemctl docker status`. For issues with `cmp`, verify `criproxy` using `systemctl criproxy status`. Inspect the logs in `/var/log/kubernetes.log`.
Tuning	Not required

KubernetesProcessOutage¶

Severity	Critical
Summary	All Kubernetes `{{ $labels.process_name }}` processes are down for 2 minutes.
Raise condition	`count(procstat_running{process_name=~"hyperkube-."}) by (process_name) == count(procstat_running{process_name=~"hyperkube-."} == 0) by (process_name)`
Description	Raises when Telegraf cannot find running `hyperkube-kubelet`, `hyperkube-proxy`, `hyperkube-apiserver`, `hyperkube-controller-manager`, and `hyperkube-scheduler` processes on all `ctl` and `cmp` hosts. The `process_name` label in the raised alert contains the process name.
Troubleshooting	Verify the containerd status on the affected node using `systemctl containerd status`. Verify the Docker status on the affected node using `systemctl docker status`. For issues with `cmp`, verify `criproxy` using `systemctl criproxy status`. Inspect the logs in `/var/log/kubernetes.log`.
Tuning	Not required

OpenContrail¶

This section describes the alerts available for OpenContrail.

Cassandra¶

This section describes the alerts for the Cassandra service.

CassandraServiceDown
CassandraServiceDownMinor
CassandraServiceDownMajor
CassandraServiceOutage

CassandraServiceDown¶

Severity	Minor
Summary	The Cassandra service on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="cassandra-server"} == 0`
Description	Raises when Telegraf cannot find running `cassandra-server` processes on the `ntw` or `nal` hosts. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Cassandra logs in the `/var/log/cassandra/` directory on the affected node.
Tuning	Not required

CassandraServiceDownMinor¶

Severity	Minor
Summary	More than 30% of Cassandra services are down.
Raise condition	`count(procstat_running{process_name="cassandra-server"} == 0) >= count(procstat_running{process_name="cassandra-server"}) *{{ monitoring.services_failed_warning_threshold_percent }}`
Description	Raises when Telegraf cannot find running `cassandra-server` processes on more than 30% of `ntw` and `nal` hosts.
Troubleshooting	Inspect the `CassandraServiceDown` alert for the host names of the affected nodes. Inspect the Cassandra logs in `/var/log/cassandra/`.
Tuning	Not required

CassandraServiceDownMajor¶

Severity	Major
Summary	More than 60% of Cassandra services are down.
Raise condition	`count(procstat_running{process_name="cassandra-server"} == 0) >= count(procstat_running{process_name="cassandra-server"}) *{{ monitoring.services_failed_critical_threshold_percent }}`
Description	Raises when Telegraf cannot find running `cassandra-server` processes on more than 60% of `ntw` and `nal` hosts.
Troubleshooting	Inspect the `CassandraServiceDown` alert for the host names of the affected nodes. Inspect the Cassandra logs in `/var/log/cassandra/`.
Tuning	Not required

CassandraServiceOutage¶

Severity	Critical
Summary	All Cassandra services are down.
Raise condition	`count(procstat_running{process_name="cassandra-server"} == 0) == count(procstat_running{process_name="cassandra-server"})`
Description	Raises when Telegraf cannot find running `cassandra-server` processes on all `ntw` and `nal` hosts.
Troubleshooting	Inspect the `CassandraServiceDown` alert for the host names of the affected nodes. Inspect the Cassandra logs in `/var/log/cassandra/`.
Tuning	Not required

Kafka¶

This section describes the alerts for the Kafka service.

KafkaServiceDown
KafkaServiceDownMinor
KafkaServiceDownMajor
KafkaServiceOutage

KafkaServiceDown¶

Severity	Minor
Summary	The Kafka service on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="kafka-server"} == 0`
Description	Raises when Kafka on a particular host does not respond to Telegraf, typically indicating that Kafka is unavailable on that node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the Kafka status by running `systemctl status kafka` on the affected node. Inspect the Kafka logs on the affected node using `journalctl -u kafka`. Inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

KafkaServiceDownMinor¶

Severity	Minor
Summary	`{{ $value }}` Kafka services are down (at least `{{monitoring.services_failed_warning_threshold_percent*100}}%`).
Raise condition	`count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.3`
Description	Raises when the Kafka cluster has more than 30% of unavailable services.
Troubleshooting	Inspect the `KafkaServiceDown` alerts for the host names of the affected nodes. Inspect the Kafka logs on the affected node using `journalctl -u kafka`.
Tuning	Not required

KafkaServiceDownMajor¶

Severity	Major
Summary	`{{ $value }}` Kafka services are down (at least `{{monitoring.services_failed_critical_threshold_percent*100}}%`).
Raise condition	`count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.6`
Description	Raises when the Kafka cluster has more than 60% of unavailable services.
Troubleshooting	Inspect the `KafkaServiceDown` alerts for the host names of the affected nodes. Inspect the Kafka logs on the affected node using `journalctl -u kafka`.
Tuning	Not required

KafkaServiceOutage¶

Severity	Critical
Summary	All Kafka services are down.
Raise condition	`count(procstat_running{process_name="kafka-server"} == 0) == count(procstat_running{process_name="kafka-server"})`
Description	Raises when all Kafka services across the cluster do not respond to Telegraf, typically indicating deployment or configuration issues.
Troubleshooting	If Kafka is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

OpenContrail¶

This section describes the general alerts for OpenContrail, such as the API, processes, instance, and health check alerts.

ContrailApiDown
ContrailApiDownMinor
ContrailApiDownMajor
ContrailApiOutage
ContrailProcessDown
ContrailProcessDownMinor
ContrailProcessDownMajor
ContrailProcessOutage
ContrailHealthCheckDisabled
ContrailHealthCheckFailed
OpencontrailInstancePingCheckDown

ContrailApiDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` API endpoint on the `{{$labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"contrail.*"} == 0`
Description	Raises when the HTTP check for the OpenContrail API endpoint is failing. The `host` and `name` labels in the raised alert contain the host name of the affected node and the service name.
Troubleshooting	Before debugging OpenContrail, inspect the Neutron API, Keystone, and Neutron server alerts, if any. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailApiDownMinor¶

Severity	Minor
Summary	More than 30% of `{{ $labels.name }}` API endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"contrail."} == 0) by (name) >= count(http_response_status{name=~"contrail."}) by (name) * 0.3`
Description	Raises when 30% of the OpenContrail API HTTP checks fail. The `name` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the `ContrailApiDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailApiDownMajor¶

Severity	Major
Summary	More than 60% of `{{ $labels.name }}` API endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"contrail."} == 0) by (name) >= count(http_response_status{name=~"contrail."}) by (name) * 0.6`
Description	Raises when 60% of the OpenContrail API HTTP checks fail. The `name` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the `ContrailApiDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailApiOutage¶

Severity	Critical
Summary	The `{{ $labels.name }}` API is not accessible for all available endpoints for 2 minutes.
Raise condition	`count(http_response_status{name=~"contrail."} == 0) by (name) == count(http_response_status{name=~"contrail."}) by (name)`
Description	Raises when the HTTP checks fail for all OpenContrail API endpoints. The `name` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the `ContrailApiDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessDown¶

Severity	Minor
Summary	The `{{ $labels.process_name }}` process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name=~"contrail.*"} == 0`
Description	Raises when the OpenContrail process is down on one node. The `host` and `process_name` labels in the raised alert contain the host name of the affected node and the process name.
Troubleshooting	Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessDownMinor¶

Severity	Minor
Summary	More than 30% of `{{ $labels.process_name }}` processes are down.
Raise condition	`count(procstat_running{process_name=~"contrail."} == 0) by (process_name) >= 0.3 count (procstat_running{process_name=~"contrail.*"}) by (process_name)`
Description	Raises when 30% of the OpenContrail processes (by name) are in the `DOWN` state. The `process_name` in the raised alert contains the affected process name.
Troubleshooting	Inspect the `ContrailProcessDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessDownMajor¶

Severity	Major
Summary	More than 60% `{{ $labels.process_name }}` processes are down.
Raise condition	`count(procstat_running{process_name=~"contrail."} == 0) by (process_name) >= 0.6 count (procstat_running{process_name=~"contrail.*"}) by (process_name)`
Description	Raises when 60% of the OpenContrail processes (by name) are in the `DOWN` state. The `process_name` in the raised alert contains the affected process name.
Troubleshooting	Inspect the `ContrailProcessDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessOutage¶

Severity	Critical
Summary	All `{{ $labels.process_name }}` processes are down.
Raise condition	`count(procstat_running{process_name=~"contrail."} == 0) by (process_name) == count(procstat_running{process_name=~"contrail."}) by (process_name)`
Description	Raises when an OpenContrail process is in the `DOWN` state on all nodes, indicating that the process is unavailable. The `process_name` in the raised alert contains the affected process name.
Troubleshooting	Inspect the `ContrailProcessDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailMetadataCheck¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Critical
Summary	The OpenContrail metadata on the `{{ $labels.host }}` node is unavailable for 15 minutes.
Raise condition	`min(exec_contrail_instance_metadata_present) by (host) == 0`
Description	Raises when the `curl 127.0.0.1:8085/Snh_LinkLocalServiceInfo \| grep -c 169.254.169.254` HTTP query returns `0`, indicating that the OpenContrail metadata service is unavailable on the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Configure > Infrastructure > Link local services and verify that metadata is configured. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

ContrailHealthCheckDisabled¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Critical
Summary	The OpenContrail health check is disabled.
Raise condition	`absent(contrail_health_exit_code) == 1`
Description	Raises when the metric from the `contrail-status` check script is absent.
Troubleshooting	Inspect the Telegraf logs on the `ntw` nodes.
Tuning	Not required

ContrailHealthCheckFailed¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Critical
Summary	The OpenContrail health check failed for the `{{ $labels.contrail_service }}` service on the `{{ $labels.host }}` node.
Raise condition	`contrail_health_exit_code != 0`
Description	Raises when any `contrail` service from the output of `contrail-status` is inactive. The `contrail_service` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the affected service.
Tuning	Not required

OpencontrailInstancePingCheckDown¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Major
Summary	The OpenContrail instance ping check on the `{{ $labels.host }}` node is down for 2 minutes.
Raise condition	`instance_ping_check_up == 0`
Description	Raises when the OpenContrail instance ping check on a node is down for 2 minutes. The `host` label in the raised alert contains the affected node name.
Tuning	Not required

OpenContrail flows¶

This section describes the alerts for OpenContrail flows.

Warning

The following set of alerts has been removed starting from the 2019.2.3 maintenance update. For the existing MCP deployments, disable these alerts as described in Manage alerts.

ContrailFlowsActiveTooHigh
ContrailFlowsDiscardedTooHigh
ContrailFlowsDroppedTooHigh
ContrailFlowsFragErrTooHigh
ContrailFlowsNextHopInvalidTooHigh
ContrailFlowsInterfaceInvalidTooHigh
ContrailFlowsLabelInvalidTooHigh
ContrailFlowsQueueSizeExceededTooHigh
ContrailFlowsTableFullTooHigh

ContrailFlowsActiveTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	More than 100 OpenContrail vRouter flows per second on the `{{ $labels.host }}` node are active for 2 minutes.
Raise condition	`deriv(contrail_vrouter_flows_active[5m]) >= 100`
Description	Raises when the number of active OpenContrail vRouter flows counted over 5 minutes reaches the growth speed of 100 per second.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsDiscardedTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	The average per-second rate of discarded OpenContrail vRouter flows on the `{{ $labels.host }}` node is more than 0.1 for 2 minutes.
Raise condition	`rate(contrail_vrouter_flows_discard[5m]) >= 0.1`
Description	Raises when the number of discarded OpenContrail vRouter flows counted over 5 minutes reaches the growth speed of 0.1 per second.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsDroppedTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	The average per-second rate of dropped OpenContrail vRouter flows on the `{{ $labels.host }}` node is more than 0.2 for 2 minutes.
Raise condition	`rate(contrail_vrouter_flows_flow_action_drop[5m]) >= 0.2`
Description	Raises when the number of dropped OpenContrail vRouter flows counted over 5 minutes reaches the growth speed of 0.2 per second.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsFragErrTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	More than 100 OpenContrail vRouter flows had fragment errors on the `{{$labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_flows_frag_err) by (host) >= 100`
Description	Raises when the number of OpenContrail vRouter flows with fragment errors counted over 5 minutes reaches 100.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsNextHopInvalidTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	The average per-second rate of OpenContrail vRouter flows with an invalid next hop on the `{{ $labels.host }}` node is more than 0.1 for 2 minutes.
Raise condition	`rate(contrail_vrouter_flows_invalid_nh[5m]) >= 0.1`
Description	Raises when the number of OpenContrail vRouter flows with an invalid next hop counted over 5 minutes reaches the growth speed of 0.1 per second.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsInterfaceInvalidTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	The average per-second rate of OpenContrail vRouter flows with an invalid composite interface on the `{{ $labels.host}}` node is more than 0.05 for 2 minutes.
Raise condition	`rate(contrail_vrouter_flows_composite_invalid_interface[5m]) >= 0.05`
Description	Raises when the number of OpenContrail vRouter flows with invalid composite interfaces counted over 5 minutes reaches the growth speed of 0.05 per second.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsLabelInvalidTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	More than 100 OpenContrail vRouter flows had an invalid composite interface on the `{{$labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_flows_invalid_label) by (host) >= 100`
Description	Raises when the number of OpenContrail vRouter flows with an invalid label (specifying next hop) reaches 100 by default.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsQueueSizeExceededTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	Warning. The average per-second rate of OpenContrail vRouter flows exceeding the queue size on the `{{ $labels.host }}` node is more than 0.1 for 2 minutes.
Raise condition	`rate(contrail_vrouter_flows_flow_queue_limit_exceeded[5m]) >= 0.1`
Description	Raises when the number of `queue exceeded` errors counted over 5 minutes reaches the growth speed of 0.1 per second.
Tuning	Disable the alert as described in Manage alerts.

ContrailFlowsTableFullTooHigh¶

^{Removed since the 2019.2.3 maintenance update}

Severity	Warning
Summary	More than 100 OpenContrail vRouter flows had a full table on the `{{$labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_flows_flow_table_full) by (host) >= 100`
Description	Raises when the OpenContrail vRouter flow table size reaches 100 on one node.
Tuning	Disable the alert as described in Manage alerts.

OpenContrail vRouter¶

This section describes the alerts for the OpenContrail vRouter alerts.

ContrailBGPSessionsNoEstablished
ContrailBGPSessionsNoActive
ContrailBGPSessionsDown
ContrailXMPPSessionsMissingEstablished
ContrailXMPPSessionsMissing
ContrailXMPPSessionsDown
ContrailXMPPSessionsTooHigh
ContrailXMPPSessionsChangesTooHigh
ContrailVrouterXMPPSessionsZero

ContrailVrouterXMPPSessionsTooHigh
ContrailVrouterXMPPSessionsChangesTooHigh
ContrailVrouterDNSXMPPSessionsZero
ContrailVrouterDNSXMPPSessionsTooHigh
ContrailVrouterDNSXMPPSessionsChangesTooHigh
ContrailVrouterLLSSessionsTooHigh
ContrailVrouterLLSSessionsChangesTooHigh
ContrailGlobalVrouterConfigCheckDisabled
ContrailGlobalVrouterConfigCheckFailed

ContrailBGPSessionsNoEstablished¶

Severity	Warning
Summary	There are no `established` OpenContrail BGP sessions on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`max(contrail_bgp_session_count) by (host) == 0`
Description	Raises when no BGP sessions in the `established` state (FSM) exist on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `bgp_peer` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailBGPSessionsNoActive¶

Severity	Warning
Summary	There are no `active` OpenContrail BGP sessions on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`max(contrail_bgp_session_up_count) by (host) == 0`
Description	Raises when no BGP sessions in the `active` state (FSM) exist on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `bgp_peer` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailBGPSessionsDown¶

Severity	Warning
Summary	The OpenContrail BGP sessions on the `{{ $labels.host }}` node are down for 2 minutes.
Raise condition	`min(contrail_bgp_session_down_count) by (host) > 0`
Description	Raises when a node has BGP sessions in the `down` state. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `bgp_peer` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailXMPPSessionsMissingEstablished¶

Severity	Warning
Summary	The OpenContrail XMPP sessions in the `established` state are missing on the compute cluster for 2 minutes.
Raise condition	`count(contrail_vrouter_xmpp) * 2 - sum(contrail_xmpp_session_up_count) > 0`
Description	Raises when the compute cluster has no OpenContrail XMPP sessions in the `established` state (FSM). No assumption is made for equal sessions distribution across the cluster. The vRouter can have 0 sessions in the `working` state. However, a properly operating compute cluster must have at least 2 connections per vRouter.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailXMPPSessionsMissing¶

Severity	Warning
Summary	The OpenContrail XMPP sessions are missing on the compute cluster for 2 minutes.
Raise condition	`count(contrail_vrouter_xmpp) * 2 - sum(contrail_xmpp_session_count) > 0`
Description	Raises when the compute cluster has no OpenContrail XMPP sessions in any state. The conditions are the same as for the `ContrailXMPPSessionsMissingEstablished` alert.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailXMPPSessionsDown¶

Severity	Warning
Summary	The `{{ $labels.host }}` node contains the OpenContrail XMPP sessions in the `down` state for 2 minutes.
Raise condition	`min(contrail_xmpp_session_down_count) by (host) > 0`
Description	Raises when a node has OpenConrail XMPP sessions in the `DOWN` state. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailXMPPSessionsTooHigh¶

Severity	Warning
Summary	There are more than 500 open OpenContrail XMPP sessions on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_xmpp_session_count) by (host) >= 500`
Description	Raises when the number of OpenContrail XMPP sessions reaches 500 on one node. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	For example, to change the threshold to 1000 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailXMPPSessionsTooHigh: if: >- min(contrail_xmpp_session_count) by (host) >= 1000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailXMPPSessionsChangesTooHigh¶

Severity	Warning
Summary	The OpenContrail XMPP sessions on the `{{ $labels.host }}` node have changed more than 100 times.
Raise condition	`abs(delta(contrail_xmpp_session_count[2m])) >= 100`
Description	Raises when the number of OpenContrail XMPP session changes reaches 100 on one node, calculated as an absolute difference of the first and last point in a two-minute time frame. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	For example, to change the threshold to >= 250 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailXMPPSessionsChangesTooHigh: if: >- abs(delta(contrail_xmpp_session_count[2m])) >= 250 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailVrouterXMPPSessionsZero¶

Severity	Warning
Summary	There are no OpenContrail vRouter XMPP sessions on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_xmpp) by (host) == 0`
Description	Raises when a node has no OpenContrail vRouter XMPP sessions. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailVrouterXMPPSessionsTooHigh¶

Severity	Warning
Summary	There are more than 10 open OpenContrail vRouter XMPP sessions on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_xmpp) by (host) >= 10`
Description	Raises when the number of OpenContrail vrouter XMPP sessions reaches 10 on one node. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > Infrastructure > Control Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	For example, to change the threshold to 20 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailVrouterXMPPSessionsTooHigh: if: >- min(contrail_vrouter_xmpp) by (host) >= 20 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailVrouterXMPPSessionsChangesTooHigh¶

Severity	Warning
Summary	The OpenContrail vRouter XMPP sessions on the `{{$labels.host }}` node have changed more than 5 times.
Raise condition	`abs(delta(contrail_vrouter_xmpp[2m])) >= 5`
Description	Raises when the number of OpenContrail vRouter XMPP session changes reaches 5 on one node, calculated as an absolute difference of the first and last points in a two-minute time frame. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > DNS Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	For example, to change the threshold to 10 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailVrouterXMPPSessionsChangesTooHigh: if: >- abs(delta(contrail_vrouter_xmpp[2m])) >= 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailVrouterDNSXMPPSessionsZero¶

Severity	Warning
Summary	There are no OpenContrail vRouter DNS-XMPP sessions on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_dns_xmpp) by (host) == 0`
Description	Raises when one node has no OpenContrail DNS-XMPP sessions. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > DNS Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	Not required

ContrailVrouterDNSXMPPSessionsTooHigh¶

Severity	Warning
Summary	More than 10 OpenContrail vRouter DNS-XMPP sessions are open on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_dns_xmpp) by (host) >= 10`
Description	Raises when the number of OpenContrail DNS-XMPP sessions reaches 10 on one node. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > DNS Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	For example, to change the threshold to 20 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailVrouterDNSXMPPSessionsTooHigh: if: >- min(contrail_vrouter_dns_xmpp) by (host) >= 20 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailVrouterDNSXMPPSessionsChangesTooHigh¶

Severity	Warning
Summary	The OpenContrail vRouter DNS-XMPP sessions on the `{{ $labels.host }}` node have changed more than 5 times.
Raise condition	`abs(delta(contrail_vrouter_dns_xmpp[2m])) >= 5`
Description	Raises when the number of OpenContrail DNS-XMPP session changes reaches 5 on one node, calculated as an absolute difference of the first and last points in a two-minute time frame. Warning For production environments, configure the alert after deployment.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Monitor > DNS Nodes and select the affected node to inspect the analytics data of the OpenContrail controller nodes. In Introspect, inspect the introspection data filtered by request type. Select the `xmpp_server` module. Verify the BGP routers configuration in Configure > Infrastructure > BGP Routers.
Tuning	For example, to change the threshold to 10 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailVrouterDNSXMPPSessionsChangesTooHigh: if: >- abs(delta(contrail_vrouter_dns_xmpp[2m])) >= 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailVrouterLLSSessionsTooHigh¶

Severity	Warning
Summary	There are more than 10 open OpenContrail vRouter LLS sessions on the `{{ $labels.host }}` node for 2 minutes.
Raise condition	`min(contrail_vrouter_lls) by (host) >= 10`
Description	Raises when the number of OpenContrail vRouter `LocalLinkService` sessions reaches 10 on one node. Warning For production environments, configure the alert after deployment.
Tuning	For example, to change the threshold to 20 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailVrouterLLSSessionsTooHigh: if: >- min(contrail_vrouter_lls) by (host) >= 20 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailVrouterLLSSessionsChangesTooHigh¶

Severity	Warning
Summary	The OpenContrail vRouter LLS sessions on the `{{$labels.host }}` node have changed more than 5 times.
Raise condition	`abs(delta(contrail_vrouter_lls[2m])) >= 5`
Description	Raises when the number of OpenContrail vRouter LLS session changes reaches 5 on one node, calculated as an absolute difference of the first and last points in a two-minute time frame. Warning For production environments, configure the alert after deployment.
Tuning	For example, to change the threshold to 10 sessions: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ContrailVrouterLLSSessionsChangesTooHigh: if: >- abs(delta(contrail_vrouter_lls[2m])) >= 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ContrailGlobalVrouterConfigCheckDisabled¶

^{Available since 2019.2.4}

Severity	Critical
Summary	The OpenContrail global vRouter configuration check is disabled.
Raise condition	`absent(contrail_global_vrouter_config_exit_code) == 1`
Description	Raises when Prometheus has no metric with the `contrail_global_vrouter_config_exit_code` name.
Troubleshooting	Inspect the Telegraf logs on the `ntw` nodes.
Tuning	Not required

ContrailGlobalVrouterConfigCheckFailed¶

^{Available since 2019.2.4}

Severity	Critical
Summary	The OpenContrail global vRouter configuration check failed on the `{{ $labels.host }}` node.
Raise condition	`contrail_global_vrouter_config_exit_code != 0`
Description	Raises when the OpenContrail Virtual Network Controller (VNC) API returns `0` or more than `1` `global-vrouter-configs`.
Troubleshooting	Inspect the output of the `contrail-status` command on any `ntw` node.
Tuning	Not required

Redis¶

This section describes the alerts for the Redis service.

RedisServiceDown
RedisServiceDownMinor
RedisServiceDownMajor
RedisServiceOutage

RedisServiceDown¶

Severity	Minor
Summary	The Redis service on the `{{$labels.host}}` node is down.
Raise condition	`procstat_running{process_name="redis-server"} == 0`
Description	Raises when Telegraf cannot find running `redis-server` processes on a node, typically indicating `MEM` consumption on the node, Redis port usage by another process, or wrong permissions set for Redis configuration or log files. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the `redis-server` service status using `systemctl status redis-server`. Inspect the `redis-server` service logs in `/var/log/redis/redis-server.log`.
Tuning	Not required

RedisServiceDownMinor¶

Severity	Minor
Summary	More than 30% of Redis services are down.
Raise condition	`count(procstat_running{process_name="redis-server"} == 0) >= count (procstat_running{process_name="redis-server"}) * 0.3`
Description	Raises when Telegraf cannot find running `redis-server` processes by default on more than 30% of the `ntw` and `nal` hosts.
Troubleshooting	Inspect the `RedisServiceDown` alerts for the host names of the affected nodes. Verify the `redis-server` service status using `systemctl status redis-server`. Inspect the `redis-server` service logs in `/var/log/redis/redis-server.log`.
Tuning	Not required

RedisServiceDownMajor¶

Severity	Major
Summary	More than 60% of Redis services are down.
Raise condition	`count(procstat_running{process_name="redis-server"} == 0) >= count (procstat_running{process_name="redis-server"}) * 0.6`
Description	Raises when Telegraf cannot find running `redis-server` processes by default on more than 60% of the `mtr` hosts.
Troubleshooting	Inspect the `RedisServiceDown` alerts for the host names of the affected nodes. Verify the `redis-server` service status using `systemctl status redis-server`. Inspect the `redis-server` service logs in `/var/log/redis/redis-server.log`.
Tuning	Not required

RedisServiceOutage¶

Severity	Critical
Summary	All Redis services are down.
Raise condition	`count(procstat_running{process_name="redis-server"} == 0) == count (procstat_running{process_name="redis-server"})`
Description	Raises when Telegraf cannot find running `redis-server` processes on all `ntw` and `nal` hosts.
Troubleshooting	Inspect the `RedisServiceDown` alerts for the host names of the affected nodes. Verify the `redis-server` service status using `systemctl status redis-server`. Inspect the `redis-server` service logs in `/var/log/redis/redis-server.log`.
Tuning	Not required

ZooKeeper¶

This section describes the alerts for the ZooKeeper service.

ZookeeperServiceDown
ZookeeperServiceErrorWarning
ZookeeperServicesDownMinor
ZookeeperServicesDownMajor
ZookeeperServiceOutage

ZookeeperServiceDown¶

Severity	Minor
Summary	The ZooKeeper service on the `{{ $labels.host }}` node is down for 2 minutes.
Raise condition	`zookeeper_up == 0`
Description	Raises when the ZooKeeper service on a host node does not respond to Telegraf, typically indicating that ZooKeeper is down on that node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the ZooKeeper status on the affected node using `service zookeeper status`. If ZooKeeper is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

ZookeeperServiceErrorWarning¶

Severity	Warning
Summary	The ZooKeeper service on the `{{ $labels.host }}` node is not responding for 2 minutes.
Raise condition	`zookeeper_service_health == 0`
Description	Raises when the ZooKeeper service on a node is not healthy (in operational mode), typically indicating that the service is unresponsive due to a high load or an operating system or hardware issue on the node.
Troubleshooting	Inspect `dmesg` and `/var/log/kern.log`. Inspect the logs in `/var/log/zookeeper`.
Tuning	Not required

ZookeeperServicesDownMinor¶

Severity	Minor
Summary	More than 30% of ZooKeeper services are down for 2 minutes.
Raise condition	`count(zookeeper_up == 0) >= count(zookeeper_up) * 0.3`
Description	Raises when a ZooKeeper cluster has more than 30% of unavailable services.
Troubleshooting	Inspect the ZooKeeper logs on any node of the affected cluster using `journalctl -u zookeeper`.
Tuning	Not required

ZookeeperServicesDownMajor¶

Severity	Major
Summary	More than 60% of ZooKeeper services are down for 2 minutes.
Raise condition	`count(zookeeper_up == 0) >= count(zookeeper_up) * 0.6`
Description	Raises when a ZooKeeper cluster has more than 60% of unavailable services.
Troubleshooting	Inspect the ZooKeeper logs on any node of the affected cluster using `journalctl -u zookeeper`.
Tuning	Not required

ZookeeperServiceOutage¶

Severity	Critical
Summary	All ZooKeeper services are down for 2 minutes.
Raise condition	`count(zookeeper_up == 0) == count(zookeeper_up)`
Description	Raises when all ZooKeeper services across a cluster do not respond to Telegraf, typically indicating deployment or configuration issues.
Troubleshooting	Inspect the ZooKeeper logs on any node of the affected cluster using `journalctl -u zookeeper`. If ZooKeeper is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

Ceph¶

This section describes the alerts for the Ceph cluster.

CephClusterHealthMinor
CephClusterHealthCritical
CephMonitorDownMinor
CephOsdDownMinor
CephOsdSpaceUsageWarning
CephOsdSpaceUsageMajor
CephPool{pool_name}SpaceUsageWarning
CephPool{pool_name}SpaceUsageCritical
CephOsdPgNumTooHighWarning
CephOsdPgNumTooHighCritical
CephPredictOsdIOPSthreshold
CephPredictOsdIOPSauto
CephPredictUsageRAM
CephPredictOsdWriteLatency
CephPredictOsdReadLatency
CephPredictPoolSpace
CephPredictPoolIOPSthreshold
CephPredictPoolIOPSauto

CephClusterHealthMinor¶

Severity	Minor
Summary	The Ceph cluster is in the `WARNING` state. For details, run `ceph -s`.
Raise condition	`ceph_health_status == 1`
Description	Raises according to the status reported by the Ceph cluster.
Troubleshooting	Run the `ceph -s` command on any Ceph node to identify the reason and resolve the issue depending on the output.
Tuning	Not required

CephClusterHealthCritical¶

Severity	Critical
Summary	The Ceph cluster is in the `CRITICAL` state. For details, run `ceph -s`.
Raise condition	`ceph_health_status == 2`
Description	Raises according to the status reported by the Ceph cluster.
Troubleshooting	Run the `ceph -s` command on any Ceph node to identify the reason and resolve the issue depending on the output.
Tuning	Not required

CephMonitorDownMinor¶

Severity	Minor
Summary	`{{ $value }}%` of Ceph Monitors are down. For details, run `ceph -s`.
Raise condition	`count(ceph_mon_quorum_status) - sum(ceph_mon_quorum_status) > 0`
Description	Raises if any of the Ceph Monitors in the Ceph cluster is down.
Troubleshooting	Inspect the `/var/log/ceph/ceph-mon.<hostname>.log` logs on the affected `cmn` node.
Tuning	Not required

CephOsdDownMinor¶

Severity	Minor
Summary	`{{ $value }}%` of Ceph OSDs are down. For details, run `ceph osd tree`.
Raise condition	`count(ceph_osd_up) - sum(ceph_osd_up) > 0`
Description	Raises if any of the Ceph OSD nodes in the Ceph cluster is down.
Troubleshooting	Inspect the `/var/log/ceph/ceph-osd.<hostname>.log` logs on the affected `osd` node.
Tuning	Not required

CephOsdSpaceUsageWarning¶

Severity	Warning
Summary	`{{ $value }}` bytes of the Ceph OSD space (>=75%) is used for 3 minutes. For details, run `cephdf`.
Raise condition	`ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * {{threshold}}`
Description	Raises when a Ceph OSD used space capacity exceeds the threshold of 75%.
Troubleshooting	Remove unused data from the Ceph cluster. Add more Ceph OSDs to the Ceph cluster. Adjust the warning threshold (use with caution).
Tuning	For example, to change the threshold to `80%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdSpaceUsageWarning: if: >- ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.8 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephOsdSpaceUsageMajor¶

Severity	Major
Summary	`{{ $value }}` bytes of the Ceph OSD space (>=85%) is used for 3 minutes. For details, run `cephdf`.
Raise condition	`ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * {{threshold}}`
Description	Raises when a Ceph OSD used space capacity exceeds the threshold of 85%.
Troubleshooting	Remove unused data from the Ceph cluster. Add more Ceph OSDs to the Ceph cluster. Adjust the warning threshold (use with caution).
Tuning	For example, to change the threshold to `95%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdSpaceUsageMajor: if: >- ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.95 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPool{pool_name}SpaceUsageWarning¶

Severity	Warning
Summary	The Ceph `{{pool_name}}` pool uses 75% of available space for 3 minutes. For details, run `ceph df`.
Raise condition	`ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{threshold}}`
Description	Raises when a Ceph pool used space capacity exceeds the threshold of 75%.
Troubleshooting	Add more Ceph OSDs to the Ceph cluster. Temporarily move the affected pool to the less occupied disks of the cluster.
Tuning	Should be tuned per pool. For example, to change the threshold to `80%` for pool volumes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephPoolvolumesSpaceUsageWarning: if: >- ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \ on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.8 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPool{pool_name}SpaceUsageCritical¶

Severity	Critical
Summary	The Ceph `{{pool_name}}` pool uses 85% of available space for 3 minutes. For details, run `ceph df`.
Raise condition	`ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{threshold}}`
Description	Raises when a Ceph pool used space capacity exceeds the threshold of 85%.
Troubleshooting	Add more Ceph OSDs to the Ceph cluster. Temporarily move the affected pool to the less occupied disks of the cluster.
Tuning	Should be tuned per pool. For example, to change the threshold to `90%` for pool volumes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephPoolvolumesSpaceUsageCritical: if: >- ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \ on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.9 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephOsdPgNumTooHighWarning¶

Severity	Warning
Summary	Some Ceph OSDs contain more than 200 PGs. This may have a negative impact on the cluster performance. For details, run `ceph pg dump`.
Raise condition	`max(ceph_osd_numpg) > 200`
Description	Raises when the number of PGs on Ceph OSDs is higher than the default threshold of 200.
Troubleshooting	When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs.
Tuning	For example, to change the threshold to `400` PGs: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdPgNumTooHighWarning: if: >- max(ceph_osd_numpg) > 400 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephOsdPgNumTooHighCritical¶

Severity	Critical
Summary	Some Ceph OSDs contain more than 300 PGs. This may have a negative impact on the cluster performance. For details, run `ceph pg dump`.
Raise condition	`max(ceph_osd_numpg) > 300`
Description	Raises when the number of PGs on Ceph OSDs is bigger than the default threshold of 300.
Troubleshooting	When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs.
Tuning	For example, to change the threshold to `500` PGs: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdPgNumTooHighCritical: if: >- max(ceph_osd_numpg) > 500 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Note

Ceph prediction alerts have been added starting from the MCP 2019.2.3 update and should be enabled manually. For details, see Enable the Ceph Prometheus plugin.

CephPredictOsdIOPSthreshold¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS on the `{{ $labels.ceph_daemon }}` Ceph OSD are increasing rapidly.
Raise condition	`predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} * 86400) > {{osd_iops_limit}}`
Description	Predicts the IOPS consumption per Ceph OSD in a specified time range, 1 week by default. The `threshold` parameter defines the time range. Warning For production environments, configure `osd_iops_limit` after deployment depending on the used hardware. For exemplary estimates for different hardware types, see IOPS.
Tuning	For example, to change `osd_iops_limit` to `200`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: osd_iops_limit: 200 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictOsdIOPSauto¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS on the `{{ $labels.ceph_daemon }}` Ceph OSD are increasing rapidly.
Raise condition	`predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} * 86400) > avg_over_time(ceph_osd_op:rate5m[1d]) * {{ iops_threshold }}`
Description	Predicts the IOPS consumption per OSD in a specified time range, 1 week by default. The `threshold` parameter defines the time range. Warning For production environments, configure `osd_iops_threshold` after deployment depending on the current cluster load and estimated limits from `CephPredictOsdIOPSthreshold`.
Tuning	For example, to change `osd_iops_threshold` to `2`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: osd_iops_threshold: 2 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictUsageRAM¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{$labels.host}}` host may run out of available RAM next week.
Raise condition	`predict_linear(mem_free{host=~"cmn.\|rgw.\|osd."}[{{threshold}}d], {{threshold}} 86400) < 0`
Description	Predicts the exhaustion of the available RAM on Ceph nodes in a defined time range.
Tuning	Not required

CephPredictOsdWriteLatency¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{$labels.name}}` on the `{{$labels.host}}` host may become unresponsive shortly. Verify the OSDs top load on the Ceph OSD Overview Grafana dashboard.
Raise condition	`predict_linear(diskio_write_time:rate5m {host=~"osd.",name=~"sd[b-z]"}[{{threshold}}d], {{threshold}} * 86400) > avg_over_time(diskio_write_time:rate5m[1d]) * {{write_latency_threshold}}`
Description	Predicts the OSD disks responsiveness in a specified time range based on the write latency. The `threshold` parameter defines the time range. The `write_latency_threshold` parameter defines the differences to detect in the write latency. Warning For production environments, configure `write_latency_threshold` after deployment.
Tuning	For example, to change `write_latency_threshold` to `2`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: write_latency_threshold: 2 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictOsdReadLatency¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{$labels.name}}` on the `{{$labels.host}}` host may become unresponsive shortly. Verify the OSDs top load on the Ceph OSD Overview Grafana dashboard.
Raise condition	`predict_linear(diskio_read_time:rate5m{host=~"osd.",name=~"sd[b-z]"} [{{threshold}}d], {{threshold}} * 86400) > avg_over_time(diskio_read_time:rate5m[1d]) * {{read_latency_threshold}}`
Description	Predicts the OSD disks responsiveness in a specified time range based on the read latency. The `threshold` parameter defines the time range. The `read_latency_threshold` parameter defines the differences to detect in the read latency. Warning For production environments, configure `read_latency_threshold` after deployment.
Tuning	For example, to change `read_latency_threshold` to `2`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: read_latency_threshold: 2 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolSpace¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{pool_name}}` pool may consume more than `{{100*space_threshold}}%` of the available capacity in 1 week. For details, run `ceph df` and plan proper actions.
Raise condition	`predict_linear(ceph_pool_bytes_used[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > (ceph_pool_bytes_used + ceph_pool_max_avail) * {{space_threshold}} * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"}`
Description	Predicts the exhaustion of all available capacity of a pool in a defined time range. The `threshold` parameter specifies the time range to use. The `space_threshold` parameter defines the capacity threshold, similar to the one set in `CephPool{pool_name}SpaceUsageCritical`. Warning For production environments, configure `space_threshold` after deployment.
Tuning	For example, to change `space_threshold` to `85`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: space_threshold: 85 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolIOPSthreshold¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS in the `{{pool_name}}` are increasing rapidly.
Raise condition	`predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{ iops_limit }}`
Description	Predicts the IOPS consumption per pool in a specified time range, 1 week by default. The `threshold` parameter specifies the time range to use. Warning For production environments, after deployment, set `pool_iops_limit` to `osd_iops_limit` from `CephPredictOsdIOPSthreshold` multiplied by the number OSDs for this pool.
Tuning	For example, to change `pool_iops_limit` to `2000`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: pool_iops_limit: 2000 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolIOPSauto¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS in the `{{pool_name}}` are increasing rapidly.
Raise condition	`predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > avg_over_time(ceph_pool_ops:rate5m[1d]) * {{ iops_threshold }}`
Description	Predicts the IOPS utilisation per pool in a specified time range, 1 week by default. The `threshold` parameter specifies the time range to use. Warning For production environments, after deployment, set `pool_iops_threshold` to `iops_limit` from `CephPredictOsdIOPSAuto` muliplied by the number of OSDs connected to each pool.
Tuning	For example, to change `pool_iops_threshold` to `3`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: pool_iops_threshold: 3 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RadosGWOutage¶

^{Available only in the 2019.2.10 maintenance update}

Severity	Critical
Summary	RADOS Gateway outage.
Raise condition	`max(openstack_api_check_status{name=~"radosgw.*"}) == 0`
Description	Raises if RADOS Gateway is not accessible for all available RADOS Gateway endpoints in the OpenStack service catalog.
Tuning	Not required

RadosGWDown¶

^{Available only in the 2019.2.10 maintenance update}

Severity	Major
Summary	The `{{ $labels.name }}` endpoint is not accessible.
Raise condition	`openstack_api_check_status{name=~"radosgw.*"} == 0`
Description	Raises if RADOS Gateway is not accessible for the `{{ $labels.name }}` endpoint.
Tuning	Not required

StackLight LMA¶

This section describes the alerts available for the StackLight LMA services.

Alertmanager¶

This section describes the alerts for the Alertmanager service.

AlertmanagerNotificationFailureWarning
AlertmanagerAlertsInvalidWarning

AlertmanagerNotificationFailureWarning¶

Severity	Warning
Summary	Alertmanager has `{{ $value }}` failed notifications for `{{$labels.integration }}` on the `{{ $labels.instance}}` instance for 2 minutes.
Raise condition	`increase(alertmanager_notifications_failed_total[2m]) > 0`
Description	Raises when Alertmanager fails to send a notification to a specific receiver or channel for the following exemplary reasons: Slack: wrong API key or channel name Email: authentication issues or the SMTP server is not available Webhook: the server is not available
Troubleshooting	Run `docker service logs monitoring_alertmanager` on any `mon` node and inspect the Alertmanager logs.
Tuning	Not required

AlertmanagerAlertsInvalidWarning¶

Severity	Warning
Summary	An average of `{{ $value }}` Alertmanager `{{$labels.integration }}` alerts on the `{{ $labels.instance }}` instance are invalid for 2 minutes.
Raise condition	`increase(alertmanager_alerts_invalid_total[2m]) > 0`
Description	Raises when Alertmanager receives an alert with errors, for example: Missing start and end time Empty labels Wrong labels or annotations. The key or value must not be empty, must start with a char and must be alphanumeric.
Troubleshooting	Run `docker service logs monitoring_alertmanager` on any `mon` node and inspect the Alertmanager logs.
Tuning	Not required

Elasticsearch¶

This section describes the alerts for the Elasticsearch service.

ElasticsearchClusterHealthStatusMajor
ElasticsearchClusterHealthStatusCritical
ElasticsearchServiceDown
ElasticsearchServiceDownMinor
ElasticsearchServiceDownMajor
ElasticsearchServiceOutage
ElasticsearchDiskWaterMarkMinor
ElasticsearchDiskWaterMarkMajor
ElasticsearchExporterNoDailyLogs

ElasticsearchClusterHealthStatusMajor¶

Severity	Major
Summary	The Elasticsearch cluster status is `YELLOW` for 2 minutes.
Raise condition	`elasticsearch_cluster_health_status == 2`
Description	Raises when the Elasticsearch cluster status is `YELLOW` for 2 minutes, meaning that Elasticsearch has allocated all of the primary shards but some or all of the replicas have not been allocated. For the exact reason, inspect the Elasticsearch logs on the `log` nodes in `/var/log/elasticsearch/elasticsearch.log`. To verify the current status of the cluster, run `curl -XGET '<host>:<port>/_cat/health?pretty'`, where `host` is `elasticsearch:client:server:host` and `port` is `elasticsearch:client:server:port` defined in your model.
Troubleshooting	Verify the status of the shards using `curl -XGET '<host>:<port>/_cat/shards'`. For details, see Cluster Allocation Explain API. If `UNASSIGNED` shards are present, reallocate the shards by running the following command on the `log` nodes: curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \ application/json' -d' { "persistent": \ {"cluster.routing.allocation.enable": "all" }}' Manually reallocate the unassigned shards as described in Cluster Reroute.
Tuning	Not required

ElasticsearchClusterHealthStatusCritical¶

Severity	Critical
Summary	The Elasticsearch cluster status is `RED` for 2 minutes.
Raise condition	`elasticsearch_cluster_health_status == 3`
Description	Raises when the Elasticsearch cluster status is `RED` for 2 minutes, meaning that some or all of primary shards are not ready. For the exact reason, inspect the Elasticsearch logs on the `log` nodes in `/var/log/elasticsearch/elasticsearch.log`. To verify the current status of the cluster, run `curl -XGET '<host>:<port>/_cat/health?pretty'`, where `host` is `elasticsearch:client:server:host` and `port` is `elasticsearch:client:server:port` defined in your model.
Troubleshooting	Verify that the Elasticsearch service is running on all `log` nodes using `service elasticsearch status`. Verify the status of the shards using `curl -XGET '<host>:<port>/_cat/shards'`. For details, see Cluster Allocation Explain API. Enable shard allocation by running the following command on the `log` nodes: curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \ application/json' -d' { "persistent": \ {"cluster.routing.allocation.enable": "all" }}' Manually reallocate the unassigned shards as described in Cluster Reroute. For more troubleshooting details, see Official Elasticsearch documentation.
Tuning	Not required

ElasticsearchServiceDown¶

Severity	Minor
Summary	The Elasticsearch service on the `{{ $labels.host }}` node is down.
Raise condition	`elasticsearch_up{host=~".*"} == 0`
Description	Raises when the Elasticsearch service is down on a `log` node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the status of the service by running `systemctl status elasticsearch` on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchServiceDownMinor¶

Severity	Minor
Summary	30% of Elasticsearch services are down for 2 minutes.
Raise condition	`count(elasticsearch_up{host=~"."} == 0) >= count(elasticsearch_up{host=~"."}) * 0.3`
Description	Raises when the Elasticsearch service is down on more than 30% of the `log` nodes. By default, 3 `log` nodes are present, meaning that the service is down on one node.
Troubleshooting	Inspect the `ElasticsearchServiceDown` alerts for the host names of the affected nodes. Verify the Elasticsearch status by running the `systemctl status elasticsearch` command on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchServiceDownMajor¶

Severity	Major
Summary	60% of Elasticsearch services are down for 2 minutes.
Raise condition	`count(elasticsearch_up{host=~"."} == 0) >= count(elasticsearch_up{host=~"."}) * 0.6`
Description	Raises when the Elasticsearch service is down on the more than 60% of `log` nodes. By default, 3 `log` nodes are present, meaning that the service is down on two nodes.
Troubleshooting	Inspect the `ElasticsearchServiceDown` alerts for the host names of the affected nodes. Verify the Elasticsearch status by running the `systemctl status elasticsearch` command on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchServiceOutage¶

Severity	Critical
Summary	All Elasticsearch services within the cluster are down.
Raise condition	`count(elasticsearch_up{host=~"."} == 0) == count(elasticsearch_up{host=~"."})`
Description	Raises when the Elasticsearch service is down on all `log` nodes.
Troubleshooting	Inspect the `ElasticsearchServiceDown` alerts for the host names of the affected nodes. Verify the Elasticsearch status by running the `systemctl status elasticsearch` command on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchDiskWaterMarkMinor¶

Severity	Minor
Summary	The Elasticsearch `{{ $labels.instance }}` instance uses 60% of disk space on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`(max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.6`
Description	Raises when the Elasticsearch instance uses 60% of disk space on the `log` node for 5 minutes. To verify the available and used disk space, run `df -h`.
Troubleshooting	Free or extend the disk space on the Elasticsearch partition. Decrease the default retention period for Elasticsearch as described in Configure Elasticsearch Curator.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available disk space on the `log` nodes and adjust the threshold according to the available space. Additionally, in the Prometheus Web UI, use the raise condition query to view the graph for a longer period of time and define the best threshold. For example, change the threshold to `80%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ElasticsearchDiskWaterMarkMinor: if: >- (max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\ - max(elasticsearch_fs_total_available_in_bytes) by \ (host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\ by (host, instance) >= 0.8 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ElasticsearchDiskWaterMarkMajor¶

Severity	Major
Summary	The Elasticsearch `{{ $labels.instance }}` instance uses 75% of disk space on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`(max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.75`
Description	Raises when the Elasticsearch instance uses 75% of disk space on the `log` node for 5 minutes. To verify the available and used disk space, run `df -h`.
Troubleshooting	Free or extend the disk space on the Elasticsearch partition. Decrease the default retention period for Elasticsearch as described in Configure Elasticsearch Curator.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available disk space on the `log` nodes and adjust the threshold according to the available space. Additionally, in the Prometheus Web UI, use the raise condition query to view the graph for a longer period of time and define the best threshold. For example, change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ElasticsearchDiskWaterMarkMajor: if: >- (max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\ - max(elasticsearch_fs_total_available_in_bytes) by \ (host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\ by (host, instance) >= 0.9 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ElasticsearchExporterNoDailyLogs¶

^{Available since 2019.2.10}

Severity	Warning
Summary	No new logs sent from a node within the last 3 hours.
Raise condition	`(sum by (host) (changes(logs_program_host_doc_count[3h])) or sum by (host) (up{host!=""})*0) == 0`
Description	Raises when no new logs were shipped from the `{{ $labels.host }}` node within the last 3 hours.
Troubleshooting	Verify that Fluentd is operating properly on the affected node.
Tuning	No tuning

Heka¶

This section describes the alerts for the Heka service.

HekaOutputQueueStalled¶

Severity	Warning
Summary	The `{{ $labels.queue }}` queue is stalled on node `{{$labels.host }}` for more than 1 hour. The corresponding Heka service is either down or stuck.
Raise condition	`heka_output_queue_size > 134217728`
Description	Raises when Heka freezes and the output queue is larger than `134217728` (128 MB). The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Restart the corresponding Heka log collector using `service log_collector restart`.
Tuning	Not required

InfluxDB¶

This section describes the alerts for InfluxDB, InfluxDB Relay, and remote storage adapter.

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

InfluxdbServiceDown
InfluxdbServicesDownMinor
InfluxdbServicesDownMajor
InfluxdbServiceOutage
InfluxdbSeriesMaxNumberWarning
InfluxdbSeriesMaxNumberCritical
InfluxdbHTTPClientErrorsWarning
InfluxdbHTTPPointsWritesFailWarning
InfluxdbHTTPPointsWritesDropWarning
InfluxdbRelayBufferFullWarning
InfluxdbRelayRequestsFailWarning
RemoteStorageAdapterMetricsSendingWarning
RemoteStorageAdapterMetricsIgnoredWarning

InfluxdbServiceDown¶

Severity	Minor
Summary	The InfluxDB service on the `{{ $labels.host }}` node is down.
Raise condition	`influxdb_up == 0`
Description	Raises when the InfluxDB service on one of the `mtr` nodes is down. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbServicesDownMinor¶

Severity	Minor
Summary	More than 30% of InfluxDB services are down.
Raise condition	`count(influxdb_up == 0) >= count(influxdb_up) * 0.3`
Description	Raises when InfluxDB services are down on more than 30% of `mtr` nodes.
Troubleshooting	Inspect the `InfluxdbServiceDown` alerts for the host names of the affected nodes. Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbServicesDownMajor¶

Severity	Major
Summary	More than 60% of InfluxDB services are down.
Raise condition	`count(influxdb_up == 0) >= count(influxdb_up) * 0.6`
Description	Raises when InfluxDB services are down on more than 60% of the `mtr` nodes.
Troubleshooting	Inspect the `InfluxdbServiceDown` alerts for the host names of the affected nodes. Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbServiceOutage¶

Severity	Critical
Summary	All InfluxDB services are down.
Raise condition	`count(influxdb_up == 0) == count(influxdb_up)`
Description	Raises when InfluxDB services are down on all `mtr` nodes.
Troubleshooting	Inspect the `InfluxdbServiceDown` alerts for the host names of the affected nodes. Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbSeriesMaxNumberWarning¶

Severity	Warning
Summary	The InfluxDB database contains 950000 time series.
Raise condition	`influxdb_database_numSeries >= 950000`
Description	Raises when the number of series collected by InfluxDB reaches the threshold of 95%. InfluxDB continues collecting the data. However, reaching the maximum series threshold is critical.
Troubleshooting	Decrease the retention policy for the affected database. Remove unused data. Increase the maximum number of series to keep in the database.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, increase the `max_series_per_database` parameter to a ten times bigger value. For example, to change the threshold to `9 500 000` and the number of series to `10 000 000`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbSeriesMaxNumberWarning: if: >- influxdb_database_numSeries >= 9500000 In `cluster/<cluster_name>/stacklight/telemetry.yml`, add: parameters: influxdb: server: data: max_series_per_database: 10000000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server salt -C 'I@influxdb:server' influxdb.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbSeriesMaxNumberCritical¶

Severity	Critical
Summary	The InfluxDB database contains 1000000 time series. No more series can be saved.
Raise condition	`influxdb_database_numSeries >= 1000000`
Description	Raises when the number of series collected by InfluxDB reaches the critical threshold of 1M series. InfluxDB is available but cannot collect more data. Any write request to the database ends with the `HTTP 500` status code and the max series per database exceeded error message. It is not possible to define the data that has not been recorded. Warning For production environments, after deployment set both the threshold and the `max_series_per_database` parameter value to `10 000 000`.
Troubleshooting	Decrease the retention policy for the affected database. Remove unused data. Increase the maximum number of series to keep in the database.
Tuning	For example, to change the number of series and the threshold to `10 000 000`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbSeriesMaxNumberCritical: if: >- influxdb_database_numSeries >= 10000000 In `cluster/<cluster_name>/stacklight/telemetry.yml`, add: parameters: influxdb: server: data: max_series_per_database: 10000000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server salt -C 'I@influxdb:server' influxdb.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPClientErrorsWarning¶

Severity	Warning
Summary	An average of 5% of HTTP client requests on the `{{ $labels.host }}` node fail.
Raise condition	`rate(influxdb_httpd_clientError[1m]) / rate(influxdb_httpd_req[1m]) * 100 > 5`
Description	Raises when the percentage of client error HTTP requests rate reaches the threshold of 5%, indicating issues with the request format, service performance, or the maximum number of series being reached. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect InfluxDB logs on the affected node using `journalctl -xfu influxdb`. Verify if the `InfluxdbSeriesMaxNumberWarning` is firing.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbHTTPClientErrorsWarning: if: >- rate(influxdb_httpd_clientError[1m]) / \ rate(influxdb_httpd_req[1m]) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPPointsWritesFailWarning¶

Severity	Warning
Summary	More than 5% of HTTP points writes on the `{{ $labels.host }}` node fail.
Raise condition	`rate(influxdb_httpd_pointsWrittenFail[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + rate(influxdb_httpd_pointsWrittenFail[1m]) + rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5`
Description	Raises when the percentage of client failed HTTP write requests reached the threshold of 5%, indicating a non-existing database or reaching of the maximum series threshold. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the InfluxDB logs on the affected node using `journalctl -xfu influxdb`. Verify if the `InfluxdbSeriesMaxNumberWarning` is firing.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbHTTPPointsWritesFailWarning: if: >- rate(influxdb_httpd_pointsWrittenFail[1m]) / \ (rate(influxdb_httpd_pointsWrittenOK[1m]) + \ rate(influxdb_httpd_pointsWrittenFail[1m]) + \ rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPPointsWritesDropWarning¶

Severity	Warning
Summary	More than 5% of HTTP points writes on the `{{ $labels.host }}` node were dropped.
Raise condition	`rate(influxdb_httpd_pointsWrittenDropped[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + rate(influxdb_httpd_pointsWrittenFail[1m]) + rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5`
Description	Raises when the percentage of client HTTP drop measurements requests reaches the threshold of 5%. Dropping of measurements must be a controlled operation, determined by the retention policy or manual actions. This alert is expected during maintenance. Otherwise, investigate the reasons. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the InfluxDB logs on the affected node using `journalctl -xfu influxdb`.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbHTTPPointsWritesDropWarning: if: >- rate(influxdb_httpd_pointsWrittenDropped[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + \ rate(influxdb_httpd_pointsWrittenFail[1m]) + \ rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbRelayBufferFullWarning¶

Severity	Warning
Summary	The InfluxDB Relay `{{ $labels.host }}` back-end buffer is 80% full.
Raise condition	`influxdb_relay_backend_buffer_bytes / 5.36870912e+08 * 100 > 80`
Description	Raises when the percentage of InfluxDB Relay summarized buffers usage reaches 80% of the threshold set to 512 MB and may be connected with InfluxDB issues. When the buffer is full, the requests cannot be cached.
Troubleshooting	Increase the buffer size as required.
Tuning	For example, to change the threshold to `90%` and the buffer size to `1024mb`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbRelayBufferFullWarning: if: >- influxdb_relay_backend_buffer_bytes / 2^20 > 1024 * 0.9 In the `_params` section in `cluster/<cluster_name>/stacklight/telemetry.yml`, specify: influxdb_relay_buffer_size_mb: 1024 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server salt -C 'I@influxdb:server' state.sls influxdb.relay Verify the updated alert definition in the Prometheus web UI.

InfluxdbRelayRequestsFailWarning¶

Severity	Warning
Summary	An average of 5% of InfluxDB Relay requests on the `{{ $labels.host }}` node fail.
Raise condition	`rate(influxdb_relay_failed_requests_total[1m]) / rate(influxdb_relay_requests_total[1m]) * 100 > 5`
Description	Raises when the percentage of InfluxDB Relay failed requests reaches the threshold of 5%, indicating issues with the InfluxDB Relay back end availability.
Troubleshooting	Inspect the InfluxDB logs on the affected node using `journalctl -xfu influxdb`. Inspect the `InfluxdbRelayBufferFullWarning` alert. Inspect the `InfluxdbSeriesMaxNumberWarning` or `InfluxdbSeriesMaxNumberCritical` alerts.
Tuning	For example, to change threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbRelayRequestsFailWarning: if: >- rate(influxdb_relay_failed_requests_total[1m]) / \ rate(influxdb_relay_requests_total[1m]) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RemoteStorageAdapterMetricsSendingWarning¶

Severity	Warning
Summary	The remote storage adapter metrics on sent to received ratio on the `{{ $labels.instance }}` instance is less than 0.9.
Raise condition	`increase(sent_samples_total{job="remote_storage_adapter"}[1m]) / on (job, instance) increase(received_samples_total[1m]) < 0.9`
Description	Raises when the sent to received metrics ratio of the remote storage adapter reaches 90%. If this ratio decreases, the adapter stops sending new metrics to a remote storage.
Troubleshooting	Verify that the remote storage adapter container is operating by running `docker ps` on the `mon` nodes. Inspect the remote storage service logs by running `docker service logs montoring_remote_storage_adapter` on any `mon` node.
Tuning	For example, change the threshold to `2` per 10 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RemoteStorageAdapterMetricsSendingWarning: if: >- increase(sent_samples_total{job="remote_storage_adapter"}[10m])\ / on (job, instance) increase(received_samples_total[10m]) < 1 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RemoteStorageAdapterMetricsIgnoredWarning¶

Severity	Warning
Summary	More than 5% of remote storage adapter metrics on the `{{ $labels.instance }}` instance are invalid.
Raise condition	`increase(prometheus_influxdb_ignored_samples_total {job="remote_storage_adapter"}[1m]) / on (job, instance) increase(sent_samples_total[1m]) >= 0.05`
Description	Rasies when the ignored to sent metrics ratio of the remote storage adapter reaches the default 5%, indicating that at least 5% of the metrics sent from the remote storage adapter were ignored by InfluxDB.
Troubleshooting	Inspect the InfluxDB alerts. Inspect the remote storage service logs by running `docker service logs montoring_remote_storage_adapter` on any `mon` node.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RemoteStorageAdapterMetricsIgnoredWarning: if: >- increase(prometheus_influxdb_ignored_samples_total\ {job="remote_storage_adapter"}[1m]) / on (job, instance) \ increase(sent_samples_total[1m]) >= 0.1 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Kibana¶

This section describes the alerts for the Kibana service.

KibanaProcessDown
KibanaProcessesDownMinor
KibanaProcessesDownMajor
KibanaServiceOutage

KibanaProcessDown¶

Severity	Minor
Summary	The Kibana process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="kibana"} == 0`
Description	Raises when Telegraf cannot find a running `kibana` process, typically indicating that the Kibana process is down on one node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the Kibana service status on the affected node using `systemctl status kibana`. Inspect the Kibana service logs using `journalctl -xfu kibana`.
Tuning	Not required

KibanaProcessesDownMinor¶

Severity	Minor
Summary	More than 30% of Kibana processes are down.
Raise condition	`count(procstat_running{process_name="kibana"} == 0) >= count(procstat_running{process_name="kibana"}) * 0.3`
Description	Raises when Telegraf cannot find running `kibana` processes on more than 30% of the `log` hosts.
Troubleshooting	Inspect the `KibanaProcessDown` alerts for the host names of the affected nodes. Verify the Kibana service status on the affected node using `systemctl status kibana`. Inspect the Kibana service logs using `journalctl -xfu kibana`.
Tuning	Not required

KibanaProcessesDownMajor¶

Severity	Major
Summary	More than 60% of Kibana processes are down.
Raise condition	`count(procstat_running{process_name="kibana"} == 0) >= count(procstat_running{process_name="kibana"}) * 0.6`
Description	Raises when Telegraf cannot find running `kibana` processes with on more than 60% of the `log` hosts.
Troubleshooting	Inspect the `KibanaProcessDown` alerts for the host names of the affected nodes. Verify the Kibana service status on the affected node using `systemctl status kibana`. Inspect the Kibana service logs using `journalctl -xfu kibana`.
Tuning	Not required

KibanaServiceOutage¶

Severity	Critical
Summary	All Kibana processes are down.
Raise condition	`count(procstat_running{process_name="kibana"} == 0) == count(procstat_running{process_name="kibana"})`
Description	Raises when Telegraf cannot find running `kibana` processes on all the `log` hosts.
Troubleshooting	Inspect the `KibanaProcessDown` alerts for the host names of the affected nodes. Verify the Kibana service status on the affected node using `systemctl status kibana`. Inspect the Kibana service logs using `journalctl -xfu kibana`.
Tuning	Not required

MongoDB¶

This section describes the MongoDB alerts.

MongoDBServiceDown
MongoDBServiceOutage
MongoDBNoPrimaryMember

MongoDBServiceDown¶

Severity	Minor
Summary	The MongoDB service on the `{{ $labels.host }}` node is down for 1 minute.
Raise condition	`mongodb_up == 0`
Description	Raises when the MongoDB process is in the `DOWN` state on a host. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the MongoDB logs using `journalctl -u mongodb` on the affected node or in `/var/log`.
Tuning	Not required

MongoDBServiceOutage¶

Severity	Critical
Summary	All MongoDB services are down for 1 minute.
Raise condition	`count(mongodb_up == 0) == count(mongodb_up)`
Description	Raises when the MongoDB processes are in the `DOWN` state on all nodes.
Troubleshooting	Inspect the `MongoDBServiceDown` alert for the affected node. Inspect the MongoDB logs using `journalctl -u mongodb` on the affected node or in `/var/log`.
Tuning	Not required

MongoDBNoPrimaryMember¶

Severity	Critical
Summary	MongoDB cluster has no primary member for 1 minute.
Raise condition	`absent(mongodb_state == 1)`
Description	Raises when the MongoDB cluster has no primary member.
Troubleshooting	Inspect the MongoDB logs using `journalctl -u mongodb` on a node with MongoDB or in `/var/log`.
Tuning	Not required

Prometheus¶

This section describes the alerts for the Prometheus service.

PrometheusTargetDown
PrometheusTargetSamplesOrderWarning
PrometheusTargetSamplesBoundsWarning
PrometheusTargetSamplesDuplicateWarning
PrometheusDataIngestionWarning
PrometheusRemoteStorageQueueFullWarning
PrometheusRelayServiceDown
PrometheusRelayServiceDownMajor
PrometheusRelayServiceOutage
PrometheusLTSServiceDown
PrometheusLTSServiceDownMajor
PrometheusLTSServiceOutage
PrometheusRuleEvaluationsFailed

PrometheusTargetDown¶

Severity	Critical
Summary	The Prometheus target for the `{{ $labels.job }}` job on the `{{ $labels.host or $labels.instance }}` node is down for 2 minutes.
Raise condition	`up != 1`
Description	Raises when Prometheus fails to scrape a target for 2 minutes. The reasons depend on the target type. For example, Telegraf-related or connectivity issues, Fluentd misconfiguration, issues with libvirt or JMX Exporter.
Troubleshooting	Depending on the target type: Inspect the Telegraf logs using `journalctl -u telegraf`. Inspect the Fluentd logs using `journalctl -u td-agent`. Inspect the `libvirt-exporter` logs using `journalctl -u libvirt-exporter`. Inspect the `jmx-exporter` logs using `journalctl -u jmx-exporter`.
Tuning	Not required

PrometheusTargetSamplesOrderWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	`{{ $value }}` Prometheus samples on the `{{$labels.instance }}` instance are out of order (as measured over the last minute).
Raise condition	`increase(prometheus_target_scrapes_sample_out_of_order_total[1m]) > 0`
Description	Raises when Prometheus observes time series samples with a wrong order. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusTargetSamplesBoundsWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	`{{ $value }}` Prometheus samples on the `{{$labels.instance }}` instance have time stamps out of bounds (as measured over the last minute).
Raise condition	`increase(prometheus_target_scrapes_sample_out_of_bounds_total[1m]) > 0`
Description	Raises when Prometheus observes samples with time stamps greater than the current time. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusTargetSamplesDuplicateWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	`{{ $value }}` Prometheus samples on the `{{$labels.instance }}` instance have duplicate time stamps (as measured over the last minute).
Raise condition	`increase(prometheus_target_scrapes_sample_duplicate_timestamp_total [1m]) > 0`
Description	Raises when Prometheus observes samples with duplicated time stamps. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusDataIngestionWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	The Prometheus service writes on the `{{$labels.instance }}` instance do not keep up with the data ingestion speed for 10 minutes.
Raise condition	`prometheus_local_storage_rushed_mode != 0`
Description	Raises when the Prometheus service writes do not keep up with the data ingestion speed for 10 minutes. Warning The alert is deprecated for Prometheus versions newer than 1.7 and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusRemoteStorageQueueFullWarning¶

Severity	Warning
Summary	The Prometheus remote storage queue on the `{{$labels.instance }}` instance is 75% full for 2 minutes.
Raise condition	`prometheus_remote_storage_queue_length / prometheus_remote_storage_queue_capacity * 100 > 75`
Description	Raises when Prometheus remote write queue is 75% full.
Troubleshooting	Inspect the remote write service configuration in the `remote_write` section in `/srv/volumes/local/prometheus/config/prometheus.yml`.
Tuning	For example, to change the warning threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: PrometheusRemoteStorageQueueFullWarning: if: >- prometheus_remote_storage_queue_length / \ prometheus_remote_storage_queue_capacity * 100 > 90 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

PrometheusRelayServiceDown¶

Severity	Minor
Summary	The Prometheus Relay service on the `{{$labels.host}}` node is down for 2 minutes.
Raise condition	`procstat_running{process_name="prometheus-relay"} == 0`
Description	Raises when Telegraf cannot find running `prometheus-relay` processes on any `mtr` host.
Troubleshooting	Verify the status of the Prometheus Relay service on the affected node using `service prometheus-relay status`. Inspect the Prometheus Relay logs on the affected node using `journalctl -u prometheus-relay`.
Tuning	Not required

PrometheusRelayServiceDownMajor¶

Severity	Major
Summary	More than 50% of Prometheus Relay services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus-relay"} == 0) >= count (procstat_running{process_name="prometheus-relay"}) * 0.5`
Description	Raises when Telegraf cannot find running `prometheus-relay` processes on more than 50% of the `mtr` hosts.
Troubleshooting	Inspect the `PrometheusRelayServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus Relay service on the affected node using `service prometheus-relay status`. Inspect the Prometheus Relay logs on the affected node using `journalctl -u prometheus-relay`.
Tuning	Not required

PrometheusRelayServiceOutage¶

Severity	Critical
Summary	All Prometheus Relay services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus-relay"} == 0) == count (procstat_running{process_name="prometheus-relay"})`
Description	Raises when Telegraf cannot find running `prometheus-relay` processes on all `mtr` hosts.
Troubleshooting	Inspect the `PrometheusRelayServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus Relay service on the affected node using `service prometheus-relay status`. Inspect the Prometheus Relay logs on the affected node using `journalctl -u prometheus-relay`.
Tuning	Not required

PrometheusLTSServiceDown¶

Severity	Minor
Summary	The Prometheus long-term storage service on the `{{$labels.host}}` node is down for 2 minutes.
Raise condition	`procstat_running{process_name="prometheus"} == 0`
Description	Raises when Telegraf cannot find running `prometheus` processes on any `mtr` host.
Troubleshooting	Verify the status of the Prometheus service on the affected node using `service prometheus status`. Inspect the Prometheus logs on the affected node using `journalctl -u prometheus`.
Tuning	Not required

PrometheusLTSServiceDownMajor¶

Severity	Major
Summary	More than 50% of the Prometheus long-term storage services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus"} == 0) >= count (procstat_running{process_name="prometheus"}) * 0.5`
Description	Raises when Telegraf cannot find running `prometheus` processes on more than 50% of the `mtr` hosts.
Troubleshooting	Inspect the `PrometheusLTSServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus service on the affected node using `service prometheus status`. Inspect the Prometheus logs on the affected node using `journalctl -u prometheus`.
Tuning	Not required

PrometheusLTSServiceOutage¶

Severity	Critical
Summary	All Prometheus long-term storage services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus"} == 0) == count (procstat_running{process_name="prometheus"})`
Description	Raises when Telegraf cannot find running `prometheus` processes on all `mtr` hosts.
Troubleshooting	Inspect the `PrometheusLTSServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus service on the affected node using `service prometheus status`. Inspect the Prometheus logs on the affected node using `journalctl -u prometheus`.
Tuning	Not required

PrometheusRuleEvaluationsFailed¶

^{Available starting from the 2019.2.7 maintenance update}

Severity	Warning
Summary	The Prometheus server for the `{{ $labels.job }}` job on the `{{ or $labels.host $labels.instance }}` node has failed evaluations for recording rules. Verify the rules state in the Status/Rules section of the Prometheus web UI.
Raise condition	`rate(prometheus_rule_evaluation_failures_total[5m]) > 0`
Description	Raises when the number of evaluation failures of Promethues recording rules continuously increases for 10 minutes. The issue typically occurs once you reload the Prometheus service after configuration changes.
Troubleshooting	Verify the syntax and metrics in the recently added custom Prometheus recording rules in the cluster model.
Tuning	Not required

Salesforce notifier¶

This section describes the alerts for the Salesforce notifier service.

SfNotifierDown
SfNotifierAuthFailure
SfNotifierErrorsWarning

SfNotifierDown¶

Severity	Critical
Summary	The `sf-notifier` service is down for 2 minutes.
Raise condition	`absent(sf_auth_ok) == 1`
Description	Rasies when the Docker container with the `sf-notifier` service is down for 2 minutes. In this case, notifications to Salesforce cannot be sent.
Troubleshooting	Inspect the Salesforce notifier service logs using `docker service logs monitoring_sf_notifier`. Since the issue may be connected to Docker or system, inspect the Docker and system alerts.
Tuning	Not required

SfNotifierAuthFailure¶

Severity	Critical
Summary	The `sf-notifier` service fails to authenticate to Salesforce for 2 minutes.
Raise condition	`sf_auth_ok == 0`
Description	Raises when the `sf-notifier` service fails to authenticate to Salesforce. In this case, notifications to Salesforce cannot be sent. The alert fires after 2 minutes.
Troubleshooting	Verify the authentication credentials. Verify that the instance type is `sandbox`. Update the cluster model parameters and redeploy the `sf-container`: On the cluster level of the Reclass model, update the cluster model parameters as required, paying attention to `sandbox` and `password`. Refresh Salt: salt '*' saltutil.refresh_pillar Redeploy the `sf-notifier` container: salt -C 'I@promethes:server and I@docker:server' state.sls docker
Tuning	Not required

SfNotifierErrorsWarning¶

^{Removed starting from the MCP 2019.2.2 update}

Severity	Warning
Summary	An average of `{{ $value }}` `sf-notifier` error requests appear for 2 minutes.
Raise condition	`increase(sf_error_count_total[2m]) > 0`
Description	Raises when the number of Salesforce notifier errors started increasing over the last 2 minutes, indicating an issue with sending of the alert notifications to Salesforce. The issue is typically connected with the limits set in the Salesforce instance. The alert fires after 2 minutes. Warning The alert has been removed starting from the 2019.2.2 maintenance update. For the existing MCP deployments, disable this alert.
Troubleshooting	Inspect the `sf-notifier` service logs in `/srv/volumes/local/sf_notifier/logs/sfnotifier.log` on the `mon` node that runs the service container. If the logs contain many entries with Salesforce exceptions, the issue can be connected with Salesforce instance limits or the account used by `sf-notifier`.
Tuning	Disable the alert as described in Manage alerts.

Telegraf¶

This section describes the alerts for the Telegraf service.

TelegrafGatherErrors
TelegrafRemoteGatherErrors

TelegrafGatherErrors¶

^{Available starting from the 2019.2.5 maintenance update}

Severity	Major
Summary	Telegraf failed to gather metrics.
Raise condition	In 2019.2.9 and prior: `rate(internal_agent_gather_errors[10m]) > 0` In 2019.2.10 and newer: `rate(internal_agent_gather_errors{job!="remote_agent"}[10m]) > 0`
Description	Raises when Telegraf has gathering errors on a node for the last 10 minutes. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Telegraf logs by running `journalctl -u telegraf` on the affected node.
Tuning	Not required

TelegrafRemoteGatherErrors¶

^{Available starting from the 2019.2.10 maintenance update}

Severity	Major
Summary	Remote Telegraf failed to gather metrics.
Raise condition	`rate(internal_agent_gather_errors{job="remote_agent"}[10m]) > 0`
Description	Raises when remote Telegraf has gathering errors for the last 10 minutes.
Troubleshooting	Inspect the Telegraf `monitoring_remote_agent` service logs.
Tuning	Not required

Alerts that require tuning¶

After deploying StackLight LMA, you may need to customize some of the predefined alerts depending on the needs of your MCP deployment. This section provides the list of alerts that require customization after deployment. Some other alerts can also be configured as required. For an entire list of alerts and their tuning capabilities, see Available StackLight LMA alerts.

Component	Alert
System	NetdevBudgetRanOutsWarning PacketsDroppedByCpuMinor PacketsDroppedByCpuWarning SystemMemoryFullMajor SystemMemoryFullWarning SystemRxPacketsDroppedTooHigh SystemTxPacketsDroppedTooHigh SystemTxPacketsErrorTooHigh ^{Available since 2019.2.15} SystemRxPacketsErrorTooHigh ^{Available since 2019.2.15} SystemCpuStealTimeWarning ^{Available since 2019.2.6} SystemCpuStealTimeCritical ^{Available since 2019.2.6} OVSTooManyPortRunningOnAgent ^{Available since 2019.2.6}
OpenStack	IronicErrorLogsTooHigh ^{Available since 2019.2.6} RabbitmqFdUsageWarning ^{Available since 2019.2.6} RabbitmqFdUsageCritical ^{Available since 2019.2.6}
OpenContrail	ContrailVrouterLLSSessionsChangesTooHigh ContrailVrouterLLSSessionsTooHigh ContrailVrouterDNSXMPPSessionsChangesTooHigh ContrailVrouterDNSXMPPSessionsTooHigh ContrailVrouterXMPPSessionsChangesTooHigh ContrailVrouterXMPPSessionsTooHigh ContrailXMPPSessionsChangesTooHigh ContrailXMPPSessionsTooHigh
Ceph	Note Ceph prediction alerts are available starting from the 2019.2.3 maintenance update and must be enabled manually. CephPredictOsdIOPSthreshold CephPredictOsdIOPSauto CephPredictOsdWriteLatency CephPredictOsdReadLatency CephPredictPoolSpace CephPredictPoolIOPSthreshold CephPredictPoolIOPSauto
InfluxDB ^{Deprecated in Q4`18}	InfluxdbSeriesMaxNumberCritical

Generate the list of alerts for a particular deployment¶

To view the list of available alerts, you can automatically generate the documentation for your particular MCP cluster deployment. Alternatively, you can see the list of alerts using the Prometheus web UI.

To generate the documentation:

Run the Sphinx Salt state:

salt -C 'I@sphinx:server' state.sls sphinx

Refer to the deployment plan to obtain the public VIP associated with the proxy nodes. For details, see openstack_proxy_address in the Product related parameters subsection of the Create a deployment metadata model using the Model Designer UI section in MCP Deployment Guide.
Paste the obtained VIP with the port 8090 to a web browser to access the generated documentation.
Navigate to the Functional Alarms Definitions section to see the list of alerts for your particular MCP cluster deployment.

Add new features to an existing StackLight LMA deployment¶

This section describes how to install new functionality on an existing StackLight LMA deployment or integrate StackLight LMA with additional services.

For example, the Alerta service installs automatically when you deploy StackLight LMA but you must install Alerta manually if you already have a running StackLight LMA deployment without Alerta.

Install Alerta¶

Alerta is a tool that receives, consolidates, and deduplicates the alerts sent by Alertmanager and visually represents them through a web UI. Alerta provides an overview of the most recent and watched alerts, and enables you to group or filter the alerts according to your needs.

To install Alerta on an existing StackLight LMA deployment:

Log in to the Salt Master node.
Update the system level of your Reclass model.

In the stacklight/server.yml file, add the following classes:

- system.mongodb.server.cluster
- system.prometheus.alerta
- system.prometheus.alertmanager.notification.alerta
- system.prometheus.server.alert.alerta_relabel

In the stacklight/client.yml file:
1. Add the system.docker.swarm.stack.monitoring.alerta class.
2. Add the following parameters:
```
_param:
  docker_image_alerta: docker-prod-local.artifactory.mirantis.com/mirantis/external/alerta-web:stable
```

In the stacklight/init.yml file, specify the following parameters:

alerta_admin_username: admin@alerta.io
alerta_admin_password: password

In the stacklight/proxy.yml file, add the following class:
```
- system.nginx.server.proxy.monitoring.alerta
```
Refresh the pillar:
```
salt '*' saltutil.refresh_pillar
```

Install MongoDB and the Alerta container:

salt -C 'I@mongodb:server' state.sls mongodb.server
sleep 30
salt -C 'I@mongodb:server' state.sls mongodb.cluster
salt -C 'I@prometheus:server' state.sls prometheus -b 1
salt -C 'I@docker:swarm:role:master and I@prometheus:server' state.sls docker
salt -C `I@nginx:server` state.sls nginx

Enable Gainsight integration¶

Caution

Starting from the MCP 2019.2.9 update, the Gainsight integration service is considered as deprecated.

Gainsight is a customer relationship management (CRM) tool and extension for Salesforce. Gainsight integration service queries Prometheus for the metrics data and sends the data to Gainsight. Mirantis uses the collected data for further analysis and reports to improve the quality of customer support. For more information, see MCP Reference Architecture: StackLight LMA components.

Note

Gainsight formats the data using Single Quote for Quote Char and commas as separators.

To enable Gainsight integration service on an existing StackLight LMA deployment:

Log in to the Salt Master node.
Update the system level of your Reclass model.

Add the classes and parameters to stacklight/client.yml as required:

For OpenStack environments, add the default Openstack-related metrics:
```
- system.prometheus.gainsight.query.openstack
```

Add the main Gainsight class:

- system.docker.swarm.stack.monitoring.gainsight

Specify the following parameters:

parameters:
  _param:
    docker_image_prometheus_gainsight: docker-prod-local.artifactory.mirantis.com/openstack-docker/gainsight:${_param:mcp_version}
    gainsight_csv_upload_url: <URL_to_Gainsight_API>
    gainsight_account_id: <customer_account_ID_in_Salesforce>
    gainsight_environment_id: <customer_environment_ID_in_Salesforce>
    gainsight_app_org_id: <Mirantis_organization_ID_in_Salesforce>
    gainsight_access_key: <Gainsight_access_key>
    gainsight_job_id: <Gainsight_job_ID>
    gainsight_login: <Gainsight_login>
    gainsight_csv_retention: <retention_in_days>

Note

To obtain the values for the above parameters, contact Mirantis Customer Success Team through cs-team@mirantis.com.

The retention period for CSV files is set to 180 days by default.

Optional. Customize the frequency of CSV uploads to Gainsight by specifying the duration parameter in the prometheus block.

Example:
```
_param:
prometheus:
  gainsight:
    crontab:
      duration: '0 0 * * *'
```

Refresh Salt grains and pillars:

salt '*' state.sls salt.minion.grains && salt '*' mine.update && salt '*' saltutil.refresh_pillar

Deploy the configuration files for the Gainsight Docker service:
```
salt -C 'I@docker:swarm' state.sls prometheus.gainsight
```

Deploy the new monitoring_gainsight Docker service:

salt -C 'I@docker:swarm:role:master' state.sls docker.client

Enable monitoring of the Open vSwitch processes¶

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

If you have deployed Neutron Open vSwitch (OVS) as a networking solution for your OpenStack environment, you can enable StackLight LMA to monitor the OVS processes and to issue an alert if the memory consumption of an OVS process exceeds 20% and 30% set by default. The procedure below implies updating of the monitoring configuration for the nodes that run OVS, typically cmp and gtw.

To enable monitoring of the OVS processes:

Log in to the Salt Master node.
Verify that OVS is enabled:
```
salt -C "I@linux:network:bridge:openvswitch" test.ping
```
The command output displays the nodes that run openvswitch, for example, cmp and gtw.
Open your Git project repository with the Reclass model on the cluster level.

In openstack/compute/init.yml and openstack/gateway.yml, specify the following parameters:

parameters:
  telegraf:
    agent:
      input:
        procstat:
          process:
            ovs-vswitchd:
              exe: ovs-vswitchd
  prometheus:
    server:
      alert:
        ProcessOVSmemoryWarning:
          if: procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.2
          for: 5m
          labels:
            severity: warning
            service: ovs
          annotations:
            summary: "ovs-vswitchd takes more than 20% of system memory"
            description: "ovs-vswitchd takes more than 20% of system memory"
        ProcessOVSmemoryCritical:
            if: procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.3
            for: 5m
            labels:
              severity: critical
              service: ovs
            annotations:
              summary: "ovs-vswitchd takes more than 30% of system memory"
              description: "ovs-vswitchd takes more than 30% of system memory"

Apply the changes:

Refresh Salt pillars:
```
salt '*' saltutil.refresh_pillar
```

Add the Telegraf configuration:

salt -C "I@linux:network:bridge:openvswitch" state.sls telegraf.agent

Add the Prometheus alerts:

salt 'mon*' state.sls prometheus.server

Enable SSL certificates monitoring¶

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

If you use SSL certificates in your MCP deployment, you can configure StackLight LMA to monitor such certificates and issue an alert when a certificate is due to expire. By default, the alerts raise if a certificate expires less than in 60 and 30 days. This allows for generating a new certificate and replacing the existing one on time to prevent from a cluster outage caused by an expired certificate.

To enable SSL certificates monitoring:

Log in to the Salt Master node.
Verify that you have updated the salt-formulas-salt package.

Update the Salt mine:

salt -C 'I@salt:minion' state.sls salt.minion.grains
salt -C 'I@salt:minion' mine.update

Update the Telegraf configuration:

salt -C 'I@telegraf:agent' state.sls telegraf

Update the Prometheus configuration:

salt -C 'I@prometheus:server and I@docker:swarm' state.sls prometheus.server

Enable SMART disk monitoring¶

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

If your MCP cluster includes physical disks that support Self-Monitoring, Analysis and Reporting Technology (SMART), you can configure StackLight LMA to monitor such disks by parsing their SMART data and to raise alerts if disk errors occur. By default, all disks on the bare metal servers will be scanned.

To enable SMART disk monitoring:

Log in to the Salt Master node.
Verify that you have updated the salt-formulas-linux package.

Install the smartmontools package as a required dependency:

salt -C 'I@salt:minion' cmd.run 'apt update; apt install -y smartmontools'

(Optional) Modify the default parameters per node or server as required, for example, to exclude a specific device from the checks.

(...)
parameters:
  _param:
  (...)
  telegraf:
    agent:
      input:
        smart:
          excludes:
            - /dev/sdd

Refresh Salt pillar:

salt -C 'I@salt:minion' saltutil.refresh_pillar

Update the Salt mine:

salt -C 'I@salt:minion' state.sls salt.minion.grains
salt -C 'I@salt:minion' mine.update

Update the Telegraf configuration:

salt -C 'I@telegraf:agent' state.sls telegraf

Update the Prometheus configuration:

salt -C 'I@prometheus:server and I@docker:swarm' state.sls prometheus.server

Enable Prometheus Elasticsearch exporter¶

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Prometheus Elasticsearch exporter queries the Elasticsearch data and exposes it as Prometheus metrics that you can view in the Prometheus web UI. This section describes how to install Prometheus Elasticsearch exporter on an existing MCP cluster. For new clusters starting from the MCP 2019.2.4 maintenance update, Prometheus Elasticsearch exporter is enabled by default.

To enable Prometheus Elasticsearch exporter on an existing MCP cluster:

Open your Git project repository with Reclass model on the cluster level.

In cluster/<cluster_name>/stacklight/client.yml, add the following class:

classes:
- system.docker.swarm.stack.monitoring.elasticsearch_exporter

In cluster/<cluster_name>/stacklight/server.yml, add the following classes:

classes:
- system.prometheus.server.target.dns.elasticsearch_exporter
- system.prometheus.elasticsearch_exporter
- system.prometheus.elasticsearch_exporter.queries.compute

Log in to the Salt Master node.
Refresh the Salt pillar:
```
salt '*' saltutil.refresh_pillar
```

Add the prometheus-es-exporter directory structure and configuration:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.elasticsearch_exporter

Deploy Prometheus Elasticsearch exporter:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls docker.client

Verify that the container has been deployed:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run 'docker service ps monitoring_elasticsearch_exporter'

Verify that querying of the Elasticsearch cluster does not cause errors:

salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run 'docker service logs monitoring_elasticsearch_exporter'

Add Prometheus Elasticsearch exporter to Prometheus targets:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server

Enable TLS for StackLight LMA¶

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To assure the confidentiality and integrity of the communication between Prometheus and Telegraf, Fluentd and Elasticsearch inside your MCP deployment, you can use cryptographic protective measures, such as the Transport Layer Security (TLS) protocol. In this case, Prometheus scrapes the data from Telegraf and Fluentd sends data to Elasticsearch or Elasticsearch VIP endpoint through encrypted channels. This section describes how to enable TLS v1.2 for an existing StackLight LMA deployment to provide message integrity with SHA384 MAC and RSA TLS certificate signature verification.

Warning

The functionality does not cover encryption of the traffic between HAProxy and Elasticsearch.

To enable TLS for StackLight LMA:

Open your project Git repository with Reclass model on the cluster level.

In infra/config/nodes.yml, add the following classes to the stacklight_log_node01 section:

stacklight_log_node01:
  classes:
  - system.elasticsearch.client.single
  - system.kibana.client.single
  - system.vnf_onboarding.common.kibana
  - system.elasticsearch.client.ssl
  - system.kibana.client.ssl

In infra/init.yml, add the following classes:

- system.salt.minion.cert.fluentd_prometheus
- system.fluentd.label.default_metric.prometheus_ssl
- system.salt.minion.cert.telegraf_agent
- system.telegraf.agent.output.prometheus_client_ssl

In stacklight/init.yml, specify the following parameter:
```
fluentd_elasticsearch_scheme: https
```
In stacklight/log.yml:
1. Replace the system.haproxy.proxy.listen.stacklight.elasticsearch class with the following one:
```
system.haproxy.proxy.listen.stacklight.elasticsearch_ssl
```
2. Add the following classes:
```
- system.kibana.server.ssl
- system.salt.minion.cert.elasticsearch
```
Log in to the Salt Master node.

Apply the following states one by one:

salt -C "I@salt:master" state.sls reclass
salt -C "I@salt:minion" state.sls salt.minion.cert
salt -C "I@salt:minion" state.sls salt.minion.grains
salt -C "I@salt:minion" mine.update
salt -C "I@fluentd:agent" state.sls fluentd
salt -C "I@docker:swarm:role:master and I@prometheus:server" state.sls prometheus.server
salt -C "I@elasticsearch:server" state.sls haproxy
salt -C "I@elasticsearch:server"  state.sls elasticsearch.server
salt -C "I@kibana:server" state.sls kibana.server

Enable Ironic monitoring¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

If you have deployed Ironic in your OpenStack environment, you can enable StackLight LMA to monitor Ironic processes and health.

To enable Ironic monitoring:

Log in to the Salt Master node.
Verify that Ironic nodes are up and running:
```
salt -C 'I@ironic:api' test.ping
```

Update the Telegraf configuration:

salt -C 'I@ironic:api' state.sls telegraf

Update the td-agent configuration:

salt -C 'I@ironic:api' state.sls fluentd

Customize the IronicErrorLogsTooHigh alert as required.

Update the Prometheus alerts configuration:

salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b 1

Update the Grafana dashboards:

salt -C 'I@grafana:client' state.sls grafana.client

Back up and restore¶

This section describes how to back up and restore MCP Control Plane.

Back up and restore the Salt Master node¶

This section describes how to create backups of the Salt Master node using Backupninja. During the Salt Master node backup, Backupninja creates backup configuration scripts containing the data about PKI and CA certificates located in the /etc/salt/pki and /etc/pki/ca directories as well as the cluster model metadata located in the /srv/salt/reclass directory. Using these backup scripts, you can easily restore your Salt Master node in case of hardware and software failures.

Back up and restore the Salt Master node prior to 2019.2.5¶

This section describes how to back up and restore the Salt Master node prior to the MCP 2019.2.5 maintenance update.

Enable a backup schedule for the Salt Master node using Backupninja¶

This section describes how to create a backup schedule for the Salt Master node using the Backupninja utility. By default, Backupninja runs daily at 1:00 AM.

To create a backup schedule for the Salt Master node using Backupninja:

Log in to the Salt Master node.
In cluster/infra/config/init.yml, verify that the following classes and parameters are present:
```
classes:
- system.backupninja.client.single
- system.openssh.client.root
parameters:
  _param:
    backupninja_backup_host: <IP>
  salt:
    master:
      backup: true
    minion:
      backup: true
```
Note

The backupninja_backup_host parameter is the IP address of the server where Backupninja runs. For example, the kvm03 node. To obtain this IP address or the host name, run salt -C 'I@backupninja:server' grains.item fqdn_ip4 or salt -C 'I@backupninja:server' grains.item fqdn respectively.
In cluster/infra/backup/server.yml, verify that the following class is present:
```
classes:
- system.backupninja.server.single
```
Optionally, override the default configuration of the backup schedule as described in Configure a backup schedule for the Salt Master node.

Apply the salt.minion state:

salt -C 'I@backupninja:server or I@backupninja:client' state.sls salt.minion

Refresh grains and mine for the backupninja client node:

salt -C 'I@backupninja:client' state.sls salt.minion.grains
salt -C 'I@backupninja:client' mine.flush
salt -C 'I@backupninja:client' mine.update

Apply the backupninja state to the backupninja client node:

salt -C 'I@backupninja:client' state.sls backupninja

Refresh grains for the backupninja server node:

salt -C 'I@backupninja:server' state.sls salt.minion.grains

Apply the backupninja state to the backupninja server node:

salt -C 'I@backupninja:server' state.sls backupninja

Once you perform the procedure, two backup configuration scripts should be created. By default, the scripts are stored in the /etc/backup.d/ directory and will run daily at 1:00 AM.

See also

Configure a backup schedule for the Salt Master node¶

By default, Backupninja runs daily at 1.00 AM. This section describes how to override the default configuration of the backup schedule for the Salt Master node. To enable the backup schedule for the Salt Master node in your deployment, refer to Enable a backup schedule for the Salt Master node using Backupninja.

To configure a backup schedule for the Salt Master node:

Log in to the Salt Master node.
Edit the cluster/infra/config/init.yml file as required. Select from the following options:
- Set the exact time of the backup server role to override the default backup time. The backup_times parameters include:
  - day_of_week
    The day of a week to perform backups. Specify 0 for Sunday, 1 for Monday, and so on. If not set, defaults to *.
  - day_of_month
    The day of a month to perform backups. For example, 20, 25, and so on. If not set, defaults to *.
    
    Note
    
    Only day_of_week or day_of_month can be active at the same time. If both are defined, day_of_week is prioritized.
  - hour
    The hour to perform backups. Uses the 24-hour format. If not defined, defaults to 1.
  - minute
    The minute to perform backups. For example, 5, 10, 59, and so on. If not defined, defaults to 00.
  Configuration example:
```
parameters:
  backupninja:
    enabled: true
    client:
      backup_times:
        day_of_week: 1
        hour: 15
        minute: 45
```
  Note
  
  These settings will change the global Backupninja schedule. If not set differently for individual steps, it will run all steps in the right order. This is recommended way of defining the exact backup order.
- Disable automatic backups:
```
parameters:
  backupninja:
    enabled: true
    client:
      auto_backup_disabled: true
```
- Re-enable automatic backups by setting the auto_backup_disabled parameter to false or delete the related line in cluster/infra/config/init.yml:
```
parameters:
  backupninja:
    enabled: true
    client:
      auto_backup_disabled: false
```
Apply changes by performing the steps 5-10 of the Enable a backup schedule for the Salt Master node using Backupninja procedure.

Create an instant backup of the Salt Master node¶

After you create a backup schedule as described in Enable a backup schedule for the Salt Master node using Backupninja, you may also need to create an instant backup of your Salt Master node.

To create an instant backup of the Salt Master node using Backupninja:

Verify that you have completed the steps described in Enable a backup schedule for the Salt Master node using Backupninja.
Log in to the Salt Master node.
Move local files to the backupninja server using rsync. For example:
```
backupninja -n --run /etc/backup.d/200.backup.rsync
```

See also

Restore the Salt Master node

Restore the Salt Master node¶

You may need to restore the Salt Master node after a hardware or software failure.

To restore the Salt Master node from a Backupninja rsync backup:

Redeploy the Salt Master node using the day01 image with the configuration ISO drive for the Salt Master VM as described in Deploy the Salt Master node.

Caution

Make sure to securely back up the configuration ISO drive image. This image contains critical information required to re-install your cfg01 node in case of storage failure, including master key for all encrypted secrets in the cluster metadata model.

Failure to back up the configuration ISO image may result in loss of ability to manage MCP in certain hardware failure scenarios.
Log in to the Salt Master node.

Configure your deployment model by including the following pillar in cluster/infra/config/init.yml:

parameters:
  salt:
    master:
      initial_data:
        engine: backupninja
        source: kvm03       # the backupninja server that stores Salt Master backups
        host: cfg01.<domain_name>  # for example: cfg01.deploy-name.local
    minion:
      initial_data:
        engine: backupninja
        source: kvm03       # the backupninja server that stores Salt Master backups
        host: cfg01.<domain_name>  # for example: cfg01.deploy-name.local

Verify that the pillar for Backupninja is present:
```
salt-call pillar.data backupninja
```
If the pillar is not present, configure it as described in Enable a backup schedule for the Salt Master node using Backupninja.
Verify that the pillar for master and minion is present:
```
salt-call pillar.data salt:minion:initial_data
salt-call pillar.data salt:master:initial_data
```
If the pillar is not present, verify the pillar configuration in cluster/infra/config/init.yml described above.
Apply the salt.master.restore and salt.minion.restore states.

Mirantis recommends running the following command using Linux GNU Screen or alternatives.
```
salt-call state.sls salt.master.restore,salt.minion.restore
```
Running the states above restores the Salt Master node`s PKI and CA certificates and creates files as a flag in the /srv/salt/ directory that indicates the Salt Master node restore is completed.

Caution

If you rerun the state, it will not restore the Salt Master node again. To repeat the restore procedure, first delete the master-restored and minion-restored files from the /srv/salt directory and rerun the above states.

Verify that the Salt Master node is restored:

salt-key
salt -t2 '*' saltutil.refresh_pillar
ls -la /etc/pki/ca/salt_master_ca/

Back up and restore the Salt Master node starting from 2019.2.5¶

This section describes how to back up and restore the Salt Master node starting from the MCP 2019.2.5 maintenance update.

Enable a backup schedule for the Salt Master node using Backupninja¶

This section describes how to create a backup schedule for the Salt Master node using the Backupninja utility. By default, Backupninja runs daily at 1:00 AM.

To create a backup schedule for the Salt Master node using Backupninja:

Log in to the Salt Master node.
In cluster/infra/config/init.yml, verify that the following classes and parameters are present:
```
classes:
- system.backupninja.client.single
- system.openssh.client.root
parameters:
  _param:
    backupninja_backup_host: <IP>
  salt:
    master:
      backup: true
    minion:
      backup: true
```
Note

The backupninja_backup_host parameter is the IP address or the host name of a node running the backupninja server. To obtain this IP address or host name, run salt -C 'I@backupninja:server' grains.item fqdn_ip4 or salt -C 'I@backupninja:server' grains.item fqdn respectively.
In cluster/infra/backup/server.yml, verify that the following class is present:
```
classes:
- system.backupninja.server.single
```
Optionally, override the default configuration of the backup schedule as described in Configure a backup schedule for the Salt Master node.

Apply the salt.minion state:

salt -C 'I@backupninja:server or I@backupninja:client' state.sls salt.minion

Refresh grains and mine for the backupninja client node:

salt -C 'I@backupninja:client' state.sls salt.minion.grains
salt -C 'I@backupninja:client' mine.update

Apply the backupninja state to the backupninja client node:

salt -C 'I@backupninja:client' state.sls backupninja

Refresh grains on the backupninja server node:

salt -C 'I@backupninja:server' state.sls salt.minion.grains

Apply the backupninja state to the backupninja server node:

salt -C 'I@backupninja:server' state.sls backupninja

Once you perform the procedure, two backup configuration scripts should be created.

Configure a backup schedule for the Salt Master node¶

Warning

This configuration presupposes manual backups or backups performed by a cron job. If you use the Backupninja backup pipeline job, see Configure the Backupninja backup pipeline.

By default, Backupninja runs daily at 1.00 a.m. This section describes how to override the default configuration of the backup schedule for the Salt Master node. To enable the backup schedule for the Salt Master node in your deployment, refer to Enable a backup schedule for the Salt Master node using Backupninja.

To configure a backup schedule for the Salt Master node:

Log in to the Salt Master node.
Edit the cluster/infra/config/init.yml file as required. Select from the following options:
- Set the exact time of the backup server role to override the default backup time. The backup_times parameters include:
  - day_of_week
    The day of a week to perform backups. Specify 0 for Sunday, 1 for Monday, and so on. If not set, defaults to *.
  - day_of_month
    The day of a month to perform backups. For example, 20, 25, and so on. If not set, defaults to *.
    
    Note
    
    Only day_of_week or day_of_month can be active at the same time. If both are defined, day_of_week is prioritized.
  - hour
    The hour to perform backups. Uses the 24-hour format. If not defined, defaults to 1.
  - minute
    The minute to perform backups. For example, 5, 10, 59, and so on. If not defined, defaults to 00.
  Configuration example:
```
parameters:
  backupninja:
    enabled: true
    client:
      backup_times:
        day_of_week: 1
        hour: 15
        minute: 45
```
  Note
  
  These settings will change the global Backupninja schedule. If not set differently for individual steps, it will run all steps in the right order. This is recommended way of defining the exact backup order.
- Disable automatic backups:
```
parameters:
  backupninja:
    enabled: true
    client:
      auto_backup_disabled: true
```
- Re-enable automatic backups by setting the auto_backup_disabled parameter to false or delete the related line in cluster/infra/config/init.yml:
```
parameters:
  backupninja:
    enabled: true
    client:
      auto_backup_disabled: false
```
Apply changes by performing the steps 5-10 of the Enable a backup schedule for the Salt Master node using Backupninja procedure.

Create an instant backup of the Salt Master node¶

This section instructs you on how to create an instant backup of the Salt Master node using Backupninja.

To create an instant backup of the Salt Master node using Backupninja:

Verify that you have completed the steps described in Enable a backup schedule for the Salt Master node using Backupninja.
Select from the following options:
- Create an instant backup of the Salt Master node automatically as described in Create an instant backup using Backupninja pipeline.
- Create an instant backup of the Salt Master node manually:
  1. Log in to the Salt Master node.
  2. Create a backup by moving local files to the backupninja server using the backupninja script which uses rsync. For example:
    backupninja -n --run /etc/backup.d/200.backup.rsync

Restore the Salt Master node¶

You may need to restore the Salt Master node after a hardware or software failure. This section instructs you on how to restore the Salt Master node using Backupninja.

To restore the Salt Master node using Backupninja:

Select from the following options:

Restore the Salt Master node automatically as described in Restore the services using Backupninja pipeline.
Restore the Salt Master node manually:
1. Redeploy the Salt Master node using the day01 image with the configuration ISO drive for the Salt Master VM as described in Deploy the Salt Master node.
  
  Caution
  
  Make sure to securely back up the configuration ISO drive image. This image contains critical information required to re-install your cfg01 node in case of storage failure, including master key for all encrypted secrets in the cluster metadata model.
  
  Failure to back up the configuration ISO image may result in loss of ability to manage MCP in certain hardware failure scenarios.
2. Log in to the Salt Master node.
3. On the cluster level of the Reclass model, add the following pillar in the cluster/infra/config/init.yml file:
```
parameters:
  salt:
    master:
      initial_data:
        engine: backupninja
        source: ${_param:backupninja_backup_host}  # the backupninja server that stores Salt Master backups, for example: kvm03
        host: ${_param:infra_config_hostname}.${_param:cluster_domain}  # for example: cfg01.deploy-name.local
        home_dir: '/path/to/backups/' # for example: '/srv/volumes/backup/backupninja'
    minion:
      initial_data:
        engine: backupninja
        source: ${_param:backupninja_backup_host}  # the backupninja server that stores Salt Master backups, for example: kvm03
        host: ${_param:infra_config_hostname}.${_param:cluster_domain}  # for example: cfg01.deploy-name.local
        home_dir: '/path/to/backups/' # for example: '/srv/volumes/backup/backupninja'
```
4. Verify that the pillar for Backupninja is present:
```
salt-call pillar.data backupninja
```
  If the pillar is not present, configure it as described in Enable a backup schedule for the Salt Master node using Backupninja.
5. Verify that the pillar for master and minion is present:
```
salt-call pillar.data salt:minion:initial_data
salt-call pillar.data salt:master:initial_data
```
  If the pillar is not present, verify the pillar configuration in cluster/infra/config/init.yml described above.
6. Apply the salt.master.restore and salt.minion.restore states.
  
  Mirantis recommends running the following command using Linux GNU Screen or alternatives.
```
salt-call state.sls salt.master.restore,salt.minion.restore
```
  Running the states above restores the Salt Master node PKI and CA certificates and creates files as a flag in the /srv/salt/ directory that indicates the Salt Master node restore is completed.
  
  Caution
  
  If you rerun the state, it will not restore the Salt Master node again. To repeat the restore procedure, first delete the master-restored and minion-restored files from the /srv/salt directory and rerun the above states.
7. Verify that the Salt Master node is restored:
```
salt-key
salt -t2 '*' saltutil.refresh_pillar
ls -la /etc/pki/ca/salt_master_ca/
```

Back up and restore an OpenStack Control Plane¶

This section describes how to back up and restore an OpenStack Control Plane and Cinder volumes. For a high avalability (HA) OpenStack environment, you need to manually back up only Glance images, Cinder volumes, and databases.

Back up and restore a MySQL database¶

The Mirantis Cloud Platform (MCP) uses MySQL databases to store the data generated by different components of MCP. Mirantis recommends backing up your MySQL databases daily to ensure the integrity of your data. You should create an instant backup before upgrading your MySQL database. You may also need to create a MySQL database backup for testing purposes.

MCP uses the Xtrabackup utility to back up MySQL databases. Xtrabackup installs automatically with your cloud environment using a SaltStack formula and includes the following components:

Xtrabackup server stores backups from rsynced Xtrabackup client nodes and runs on any node, for example, the Salt Master node.
Xtrabackup client sends database backups to the Xtrabackup server and runs on the MySQL Galera Database Master node.

This section describes how to create and restore your MySQL databases using Xtrabackup.

Enable a backup schedule for a MySQL database¶

To ensure the consistent and timely backing up of your data, create a backup schedule using Xtrabackup.

To create a backup schedule for a MySQL database:

Log in to the Salt Master node.
Verify the configuration of the backup server nodes:
```
salt -C 'I@xtrabackup:server' test.ping
```
If the output of the command above is not empty, move to the next step. Otherwise, configure the xtrabackup server role by adding the following lines in cluster/infra/config/init.yml:

Note

By default, Xtrabackup keeps three complete backups and their incrementals on the xtrabackup client node.
```
classes:
- system.xtrabackup.server.single
parameters:
  _param:
     xtrabackup_public_key: <generate_your_keypair>
```
Sync the pillar data:
```
salt '*' saltutil.sync_all
```

Verify the configuration of the backup client nodes:

salt -C 'I@xtrabackup:client' test.ping

If the output is not empty, move to the next step. Otherwise, configure the xtrabackup client role by adding the following lines in cluster/openstack/database/init.yml:

classes:
- system.xtrabackup.client.single
parameters:
  _param:
    xtrabackup_remote_server: cfg01
    root_private_key: |
      <generate_your_keypair>

Verify that the xtrabackup_remote_server parameter is defined correctly:
```
salt -C 'I@xtrabackup:client' pillar.get _param:xtrabackup_remote_server
```
If the system response does not contain an IP address, add the following parameter to cluster/openstack/database/init.yml:
```
parameters:
  _param:
    xtrabackup_remote_server: <host_name>
```
Substitute <host_name> with the resolvable host name of the host on which the Xtrabackup server is running. For example, kvm03, which is the default value in MCP.
Optionally, override the default Xtrabackup configuration as described in Configure a backup schedule for a MySQL database.
Run the following command on the Salt Master node:
```
salt '*' saltutil.refresh_pillar
```

Apply the salt.minion state:

salt -C 'I@xtrabackup:client or I@xtrabackup:server' state.sls salt.minion

Refresh grains for the xtrabackup client node:

salt -C 'I@xtrabackup:client' saltutil.sync_grains

Update the mine for the xtrabackup client node:

salt -C 'I@xtrabackup:client' mine.flush
salt -C 'I@xtrabackup:client' mine.update

Apply the xtrabackup client state:

salt -C 'I@xtrabackup:client' state.sls openssh.client,xtrabackup

Apply the linux.system.cron state:

salt -C 'I@xtrabackup:server' state.sls linux.system.cron

Apply the xtrabackup server state:

salt -C 'I@xtrabackup:server' state.sls xtrabackup

See also

Configure a backup schedule for a MySQL database¶

This section instructs you on how to configure a backup schedule for a MySQL database using the Xtrabackup service.

MCP provides the following options for the MySQL backup schedule configuration:

Recommended. Back up the MySQL database using the Jenkins pipeline job to run the Xtrabackup script on a predefined schedule.
Back up the MySQL database by running the Xtrabackup script using the predefined crontab record.

Depending on your preferences, proceed with one of the following sections:

Configure a backup schedule in the Jenkins pipeline¶

This section describes how to configure the MySQL backup schedule in the Galera database backup Jenkins pipeline.

To configure the backup schedule for a MySQL database:

Verify that you have enabled the schedule as described in Enable a backup schedule for a MySQL database.
Verify that cron for the innobackupex-runner script is disabled:
1. Log in to the Salt Master node.
2. Verify that the value of the cron parameter is defined in the pillar data. The parameter should be set to false.
```
salt -C 'I@xtrabackup:client' pillar.data xtrabackup:client:cron
salt -C 'I@xtrabackup:server' pillar.data xtrabackup:server:cron
```
  If the parameter is not defined or set to true, manually change the value to false in the infra/backup/client_mysql.yml on the cluster Reclass level.
3. If the schedule was configured before, remove the old schedule configuration by removing the backup_times pillar block from the cluster/infra/config/init.yml and cluster/openstack/database/init.yml files if it is present in parameters:xtrabackup:client.
4. Apply the xtrabackup state:
```
salt -C 'I@xtrabackup:client or I@xtrabackup:server' state.sls xtrabackup
```
5. On the dbs01 node, verify that crontab is working and the Xtrabackup job is disabled:
```
crontab -l
# Lines below here are managed by Salt, do not edit
# SALT_CRON_IDENTIFIER:/usr/local/bin/innobackupex-runner.sh
# DISABLED 0 */12 * * * /usr/local/bin/innobackupex-runner.sh
```
6. Verify the system response of the following command:
```
salt -C I@xtrabackup:client pillar.get _param
```
  The system response should include the following parameters for the Jenkins trigger but the values may vary:
```
parameters:
  _param:
    backup_min: "0"
    backup_hour: "*/12"
    backup_day_of_month: "*"
    backup_month: "*"
    backup_day_of_week: "*"
```
  If the above parameters are not defined, set them up in infra/init.yml on the cluster Reclass level.

Configure the Galera database backup pipeline as required:

Open the cluster Reclass level of your deployment.
Move the include of the infra.backup.client_mysql level from openstack/database/master.yml to openstack/database/init.yml.

Refresh the grains to be able to get correct IP addresses for /.ssh/authorized_keys.

salt -C I@xtrabackup:client state.sls salt
salt -C I@xtrabackup:client mine.update
salt -C I@xtrabackup:client saltutil.sync_all

Apply the xtrabackup state to configure access for new nodes:
```
salt -C I@xtrabackup:server state.sls xtrabackup
```

In cicd/control/leader.yml, add the following class:

classes:
- system.jenkins.client.job.deploy.galera_database_backup

Re-apply the jenkins state on the cid01 node:

salt -C I@jenkins:client state.sls jenkins

Optional. To define a custom backup time, override the backup paramateters in the infra/init.yml file:

parameters:
  _param:
    backup_min: "0"
    backup_hour: "*/12"
    backup_day_of_month: "*"
    backup_month: "*"
    backup_day_of_week: "*"

Backup parameters details¶
Parameter name	Description	Possible values
`backup_min`	Value in minutes when the backup should run	`0-59`, `*`
`backup_hour`	Value in hours when the backup should run	`0-23`, `*`
`backup_day_of_month`	Value in days of month when the backup should run	`1-31`, `*`
`backup_month`	Value in months when the backup should run	`0-23`, `*`
`backup_day_of_week`	Value in days of week when the backup should run	`0-6`, `*`

All parameters can also use / to mark the iteration. For example, the */12 value for the backup_hour parameter means that the backup will start every twelfth hour. See Jenkins cron syntax in the official Jenkins documentation for the details.

Configure a backup schedule using the Xtrabackup script directly¶

This section describes how to configure the MySQL backup schedule using the Xtrabackup script based on the crontab record.

Note

We recommend that you Configure a backup schedule in the Jenkins pipeline instead of using the Xtrabackup script directly.

By default, Xtrabackup stores three complete backups with their incremental backups. This section describes how to override the default configuration of the backup schedule for a MySQL database. To enable the backup schedule in your deployment, refer to Enable a backup schedule for a MySQL database.

To configure a backup schedule for a MySQL database:

Log in to the Salt Master node.
Select from the following options:
- Override the default Xtrabackup configuration by setting a custom time interval in cluster/infra/config/init.yml.
  1. Set the following parameters as required:
    
    hours_before_full
    Sets the full backup frequency. If set to 48, the full backup is performed every two days. Within 48 hours, only incremental backups are performed.
    
    full_backups_to_keep
    Sets the number of full backups to keep.
    
    incr_before_full
    Sets the number of incremental backups to be performed between full backups. If set to 3, three incremental backups will be performed between full backups.
    
    cron
    If set to false, disables the automatic backup by removing the cron job that triggers an automatic backup. Set cron: true to enable the automatic backup.
    
    Configuration example:
    parameters: xtrabackup: server: enabled: true hours_before_full: 48 full_backups_to_keep: 5 incr_before_full: 3 cron: false
  2. Verify that the hours_before_full parameter of the xtrabackup client in cluster/openstack/database/init.yml matches the same parameter of the xtrabackup server in cluster/infra/config/init.yml.
- Set the exact backup time for the Xtrabackup server role in cluster/infra/config/init.yml and the Xtrabackup client role in cluster/openstack/database/init.yml by configuring the backup_times section.
  
  The backup_times parameters include:
  
  day_of_week
  The day of a week to perform backups. Specify 0 for Sunday, 1 for Monday, and so on. If not set, defaults to *.
  
  day_of_month
  The day of a month to perform backups. For example, 20, 25, and so on. If not set, defaults to *.
  
  Note
  
  Only day_of_week or day_of_month can be active at the same time. If both are defined, day_of_week is prioritized.
  
  month
  The month to perform backups. Available values include 1 for January, 2 for February, and so on up to 12 for December.
  
  hour
  The hour to perform backups. Uses the 24-hour format. If not defined, defaults to 1.
  
  minute
  The minute to perform backups. For example, 5, 10, 59, and so on. If not defined, defaults to 00.
  
  Note
  
  If any of the individual backup_times parameters is not defined, the default * value will be used. For example, if the minute parameter is *, the backup will run every minute, which is usually not desired.
  
  Caution
  
  Only backup_times section or hours_before_full(incr) can be active. If both are defined, the backup_times section will be prioritized.
  
  Configuration example for the Xtrabackup server role:
```
parameters:
  xtrabackup:
    server:
      enabled: true
      full_backups_to_keep: 3
      incr_before_full: 3
      backup_dir: /srv/backup
      backup_times:
        day_of_week: 0
        hour: 4
        minute: 52
      key:
        xtrabackup_pub_key:
          enabled: true
          key: key
```
  Configuration example for the Xtrabackup client role:
```
parameters:
  xtrabackup:
    client:
      enabled: true
      full_backups_to_keep: 3
      incr_before_full: 3
      backup_times:
        day_of_week: 0
        hour: 4
        minute: 52
      compression: true
      compression_threads: 2
      database:
        user: username
        password: password
      target:
      host: cfg01
      cron: true
```
  The cron parameter, if set to false, disables the script that automatically triggers the backup scripts. For the correct backup strategy, set up cron to true or use the backup pipeline as described in Configure a backup schedule in the Jenkins pipeline.
Apply changes by performing the steps 5-10 of the Enable a backup schedule for a MySQL database procedure.

See also

Restore a Galera cluster

Create an instant backup of a MySQL database¶

This section instructs you on how to create an instant backup of a MySQL database using the Xtrabackup service.

To create an instant backup for a MySQL database using Xtrabackup:

Verify that you have completed the steps described in Enable a backup schedule for a MySQL database and optionally in Configure a backup schedule for a MySQL database.
Create an instant backup either using Jenkins or manually.

Create an instant backup of a MySQL database automatically¶

After you create a backup schedule as described in Enable a backup schedule for a MySQL database, you may also need to create an instant backup of a MySQL database.

To create an instant backup automatically:

Open the Galera database backup pipeline.

Specify the required parameters:

Galera database backup parameters¶
Parameter	Description and values
SALT_MASTER_URL	Define the IP address of your Salt Master node host and the `salt-api` port. For example, `http://172.18.170.27:6969`.
SALT_MASTER_CREDENTIALS	Define `credentials_id` as credentials for the connection.
OVERRIDE_BACKUP_NODE	Fill in the name of the node to back up if you want to override the automatic node selection. The default value is `none` that triggers the automatic node selection workflow.

Click Deploy.

The pipeline workflow:
1. Primary component location and node selection. The pipeline locates the current primary component and selects one of its nodes to use as a source of data for the backup.
  
  Note
  
  The pipeline skips this stage if the OVERRIDE_BACKUP_NODE parameter is defined with a preferred node name.
2. Backup preparation. Several necessary steps are run to verify that the openssh and xtrabackup states are up to date.
3. Backup.
4. Cleanup. The cleanup script is triggered to clean the temporary directories and old backups.

Verify that the backup has been created and, in case of the remote backup storage, moved correctly.

Create an instant backup of a MySQL database manually¶

After you create a backup schedule as described in Enable a backup schedule for a MySQL database, you may also need to create an instant backup of a MySQL database.

To create an instant backup manually:

Log in to the MySQL database master node. For example, dbs01.
Run the following script:
```
/usr/local/bin/innobackupex-runner.sh
```
Verify that a complete backup has been created on the MySQL database master node:
```
ls /var/backups/mysql/xtrabackup/full
```
If you rerun /usr/local/bin/innobackupex-runner.sh, it creates an incremental backup for the previous complete backup in /var/backups/mysql/xtrabackup/incr.

You can pass the following flags with the innobackupex-runner.sh script:
- -s makes the script to skip the cleanup. It can be useful if you want to trigger a manual backup keeping all the previous backups. Be aware that once you run the script without the -s flag or if an automatic backup is triggered, the backups will be cleaned and only the defined number of the latest backups will be kept.
- -f forces the script to run the full backup instead of an incremental one.
- -c forces the script to run only the cleanup.
Verify that the complete backup has been rsynced to the xtrabackup server node from the Salt Master node:
```
salt -C 'I@xtrabackup:server' cmd.run 'ls /var/backups/mysql/xtrabackup/full'
```
Note

If you run /usr/local/bin/innobackupex-runner.sh more than once, at least one incremental backup is created in /var/backups/mysql/xtrabackup/incr on the node.

See also

Restore a Galera cluster

Restore a MySQL database¶

You may need to restore a MySQL database after a hardware or software failure or if you want to create a clone of the existing database from a backup.

To restore a MySQL database using Xtrabackup:

Log in to the Salt Master node.
In the cluster/infra/backup/client_mysql.yml file, add the following configuration for the Xtrabackup client. If the file does not exist, edit cluster/openstack/database/init.yml.
```
parameters:
  xtrabackup:
    client:
      enabled: true
      restore_full_latest: 1
      restore_from: remote
```
where:
- restore_full_latest can have the following values: 1 or 2. 1 means restoring the database from the last complete backup and its increments. 2 means restoring the second latest complete backup and its increments.
- restore_from can have the following values: local or remote. The remote value uses scp to get the files from the xtrabackup server.

Proceed with either automatic restore steps using the Jenkins web UI pipeline or with manual restore steps:

Automatic restore steps:

Add the upgrade pipeline to DriveTrain:
1. Verify that the following lines are present in cluster/cicd/control/leader.yml:
```
classes:
- system.jenkins.client.job.deploy.galera_verify_restore
```
2. Run the salt -C 'I@jenkins:client' state.sls jenkins.client state.
Log in to the Jenkins web UI.
Open the Verify and Restore Galera pipeline.

Specify the following parameters:

Parameter	Description and values
RESTORE_TYPE	Set job execution type. Select `ONLY_RESTORE`.
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.

Click Deploy.
Open the Console Output of the build from the navigation menu to control the deployment progress and manually confirm actions when prompted during the job execution.
Once the Verify and Restore Galera pipeline finishes successfully, revert the changes made in cluster/openstack/database/init.yml in the step 2.

Manual restore steps:
1. Stop the mysql service on the MySQL Galera Database dbs02 and dbs03 nodes:
```
salt -C 'I@galera:slave' service.stop mysql
```
2. Remove the MySQL log files from the MySQL Galera Database dbs02 and dbs03 nodes:
```
salt -C 'I@galera:slave' cmd.run 'rm /var/lib/mysql/ib_logfile*'
```
3. Stop the mysql service on the MySQL Galera Database Master node:
```
salt -C 'I@galera:master' service.stop mysql
```
4. Log in to the MySQL Galera Database Master node.
5. Replace the wsrep_cluster_address row in /etc/mysql/my.cnf with the following:
```
wsrep_cluster_address="gcomm://"
```
6. Log in to the Salt Master node.
7. Move the MySQL database files to a new location /root/mysql/mysql.bak/ on the MySQL Galera Database Master node:
```
salt -C 'I@galera:master' cmd.run 'mkdir -p /root/mysql/mysql.bak/'
salt -C 'I@galera:master' cmd.run 'mv /var/lib/mysql/* /root/mysql/mysql.bak'
salt -C 'I@galera:master' cmd.run 'rm /etc/salt/.galera_bootstrap'
```
8. Verify that the MySQL database files are removed from /var/lib/mysql/ on the MySQL Galera Database Master node:
```
salt -C 'I@galera:master' cmd.run 'ls /var/lib/mysql/'
salt -C 'I@galera:master' cmd.run 'ls -ld /var/lib/mysql/.?*'
```
9. Log in to the MySQL Galera Database Master node where the restore operation occurs.
10. Run the following state that restores the databases and creates a file in /var/backups/mysql/xtrabackup/dbrestored.
  
  Mirantis recommends running the following command using Linux GNU Screen or alternatives.
```
salt-call state.sls xtrabackup.client.restore
```
  If you rerun the state, it will not restore the database again. To repeat the restore procedure, first delete the /var/backups/mysql/xtrabackup/dbrestored file and then rerun the above xtrabackup state again.
11. Log in to the Salt Master node.
12. Verify that the MySQL database files are present again on the MySQL Galera Database Master node:
```
salt -C 'I@galera:master' cmd.run 'ls /var/lib/mysql/'
```
13. Start the mysql service on the MySQL Galera Database Master node:
```
salt -C 'I@galera:master' service.start mysql
```
  Note
  
  This process takes a certain amount of time and does not provide an immediate output.
14. Start the mysql service on the MySQL Galera Database dbs02 and dbs03 nodes from the Salt Master node:
```
salt -C 'I@galera:slave' service.start mysql
```
  Note
  
  This process takes a certain amount of time and does not provide an immediate output.
15. Verify that all MySQL Galera Database nodes joined the Galera cluster:
```
salt -C 'I@galera:master' mysql.status | grep -A1 wsrep_cluster_size
```
16. Revert the changes made in cluster/openstack/database/init.yml in the step 2 and in /etc/mysql/my.cnf in the step 5.

See also

Restore a Galera cluster

Back up and restore Glance images¶

This section instructs you on how to back up and restore Glance images and covers both the local storage and the Ceph back end for Glance configurations. Both procedures presuppose that the Glance API is in the working state.

To back up Glance images:

Log in to the Salt Master node.
Verify that there is enough space on the OpenStack controller node to which you will copy the Glance images.

Copy the images to the backup destination preserving the context. For example, to copy the images to the ctl01 node:

salt -C 'I@glance:server and *01*' cmd.run '. /root/keystonercv3; cd <PATH_TO_BACKUP>; for i in `openstack image list -c ID -f value`; \
do openstack image save --file $i $i; done'

If needed, move the backed up images to the external backup location.

To restore Glance images:

Log in to the Salt Master node.
If needed, copy the images from the external backup location to an OpenStack controller node.
Copy the images from the OpenStack controller node to the Glance folder with the images. In the example commands below, the ctl01 node is used as the OpenStack controller node containing the backed up Glance images:

Note

GlusterFS will automatically replicate the restored images to all other OpenStack controller nodes.
- If the local storage is used, run:
```
salt -C 'I@glance:server and *01*' cmd.run "cp -a <PATH_TO_BACKUP>/. /var/lib/glance/images/"
```
- If the Ceph back end is used, run:
```
salt -C 'I@glance:server and *01*' cmd.run ". /root/keystonercv3; cd <PATH_TO_BACKUP>; for i in `openstack image list -c ID -f value`; \
do rbd import -p images $i; rbd snap create -p images $i@snap; done"
```

Verify that the restored Glance images are available:

salt -C 'I@keystone:client' cmd.run ". /root/keystonercv3; openstack image list"

Note

If the context of the Glance images files is lost, run:

salt -C 'I@glance:server' cmd.run "chown glance:glance <IMAGE_FILE_NAME>"
salt -C 'I@glance:server' cmd.run "chmod 640 <IMAGE_FILE_NAME>"

Back up and restore Cinder volumes and snapshots¶

This section describes how to back up and restore Cinder volumes and snapshots in an OpenStack cluster with Ceph. Starting from the MCP 2019.2.4 maintenance update, the Ceph backup engine parameters are enabled by default. For MCP versions earlier than 2019.2.4, enable these parameters manually.

Manually enable the Ceph backup engine for Cinder¶

If your MCP version is earlier than 2019.2.4, before you back up or restore the Cinder volumes and snapshots, manually enable the Ceph backup engine parameters as described below. If your MCP version is 2019.2.4 or later, these parameters are enabled by default and you can proceed to Create a backup or restore Cinder volumes and snapshots right away.

To manually enable the Ceph backup engine for Cinder:

Log in to the Salt Master node.
Open your Git project repository with the Reclass model on the classes/cluster/<cluster_name>/ level.

In /ceph/setup.yml, add a new pool called backups. For the pg_num and pgp_num parameters, copy the values from any existing Ceph pools, since these values are created from one source for all pools.

For example:

parameters:
  ceph:
    setup:
      pool:
        backups:
          pg_num: 8
          pgp_num: 8
          type: replicated
          application: rbd

In /ceph/common.yml, add the profile rbd pool=backups OSD permission for the cinder user to the existing Ceph keyring OSD permissions for Cinder:

parameters:
  ceph:
    common:
      keyring:
        cinder:
          caps:
            osd: "profile rbd pool=volumes, profile rbd-read-only pool=images, profile rbd pool=backups"

In /openstack/init.yml, add the following definitions:

parameters:
  _param:
    cinder_ceph_backup_pool: backups
    cinder_ceph_stripe_count: 0
    cinder_ceph_stripe_unit: 0
    cinder_ceph_backup_user: cinder
    cinder_ceph_chunk_size: 134217728

In /openstack/control.yml, add the following classes:

classes:
- system.cinder.control.backup.ceph
- system.cinder.volume.backup.ceph

Update pillars and create a new pool:

sudo salt '*' saltutil.refresh_pillar
sudo salt 'cmn01*' state.apply ceph.setup

Update the keyring for the cinder user. In the command below, substitute the --cap osd value with the one added to /ceph/common.yml in the step 4. For example:

ceph-authtool /etc/ceph/ceph.client.cinder.keyring -n client.cinder \
 --cap osd 'profile rbd pool=volumes, profile rbd-read-only pool=images, profile rbd pool=backups' \
 --cap mon 'allow r, allow command \"osd blacklist\"'

Apply the changes for Ceph:

ceph auth import -i /etc/ceph/ceph.client.cinder.keyring

Install the cinder-backup package and adjust the Cinder configuration:
```
sudo salt 'ctl*' state.apply cinder
```

Now, you can proceed to Create a backup or restore Cinder volumes and snapshots.

Create a backup or restore Cinder volumes and snapshots¶

This section describes how to back up and restore Cinder volumes and snapshots in an OpenStack cluster. If your MCP version is earlier than 2019.2.4, before proceeding with the steps below, Manually enable the Ceph backup engine for Cinder.

Note

MCP sets the cinder-backup service to work with Ceph as a backup driver. The configuration parameters of the Cinder backup service are defined in the Salt formula. For details and configuration examples, see the SaltStack Cinder formula.

To back up a Cinder volume:

Create a full backup of a volume by running the following command on the Salt Master node:

salt -C 'I@keystone:client' cmd.run ". /root/keystonercv3; \
        openstack volume backup create --force <VOLUME_ID>"

Note

After a full backup is created, use the --incremental flag for further backups.

To back up a Cinder snapshot, run:

salt -C 'I@keystone:client' cmd.run ". /root/keystonercv3; \
        openstack volume backup create --force --snapshot <SNAPSHOT_ID> \
                                                            <VOLUME_ID>"

Note

For further backups, use the --incremental flag.

To restore Cinder volumes or snapshots, run:

salt -C 'I@keystone:client' cmd.run ". /root/keystonercv3; \
        openstack volume backup restore <BACKUP_ID> <VOLUME_ID>"

Back up and restore OpenContrail¶

This section describes how to back up and restore Cassandra and ZooKeeper databases for OpenContrail 3.2 and 4.x.

Back up and restore a Cassandra database¶

MCP uses a custom script to back up Cassandra databases. Cassandra is part of every OpenContrail deployment and the backup utility includes the following components:

Cassandra server stores Cassandra backups rsynced from the Cassandra client nodes and runs on any node, for example, the Salt Master node.
Cassandra client sends database backups to the Cassandra server and runs on one node of the Cassandra cluster.

This section describes how to create and restore Cassandra databases for OpenContrail 3.2 and 4.x.

OpenContrail 3.2: Create a backup schedule for a Cassandra database¶

This section describes how to create a backup schedule for a Cassandra database for your OpenContrail 3.2 cluster.

To create a backup schedule for a Cassandra database:

Configure the cassandra server role:

Add the following class to cluster/infra/config.yml:

classes:
- system.cassandra.backup.server.single
parameters:
  _param:
     cassandra_backup_public_key: <generate_your_keypair>

By default, adding this include statement results in the Cassandra backup server keeping five full backups. To change the default setting, include the following pillar to cluster/infra/config.yml.

parameters:
  cassandra:
    backup:
      cron: True
      server:
        enabled: true
        hours_before_full: 24
        full_backups_to_keep: 5

Add the following lines to cluster/infra/config.yml:

reclass:
  storage:
    node:
      opencontrail_control_node01:
        classes:
        - cluster.${_param:cluster_name}.opencontrail.control_init

Configure the cassandra client role by adding the following lines to cluster/opencontrail/control_init.yml and specifying the SSH key pair. Create this file, if not present.
```
classes:
- system.cassandra.backup.client.single
parameters:
  _param:
    cassandra_remote_backup_server: cfg01
    root_private_key: |
      <generate_your_keypair>
```
By default, adding this include statement results in Cassandra keeping three complete backups on the cassandra client node. The rsync command moves the backup files to the Salt Master node. To change the default setting, include the following pillar to cluster/opencontrail/control_init.yml:
```
parameters:
  cassandra:
    backup:
      cron: True
      client:
        enabled: true
        cleanup_snaphots: true
        full_backups_to_keep: 3
        hours_before_full: 24
        target:
          host: cfg01
```
Note
- The target.host parameter must contain the resolvable hostname of the host where the cassandra server is running.
- Mirantis recommends setting true for the cleanup_snaphots parameter so that the backup script removes all previous database snapshots once the current database backup is done.
If you customized the default parameters, verify that the hours_before_full parameter of the cassandra client in cluster/opencontrail/control_init.yml matches the same parameter of the cassandra server in cluster/infra/config.yml.
Run the following command on the Salt Master node:
```
salt '*' saltutil.refresh_pillar
```

Apply the salt.minion state:

salt -C 'I@cassandra:backup:client or I@cassandra:backup:server' state.sls salt.minion

Refresh grains for the cassandra client node:

salt -C 'I@cassandra:backup:client' saltutil.sync_grains

Update the mine for the cassandra client node:

salt -C 'I@cassandra:backup:client' mine.flush
salt -C 'I@cassandra:backup:client' mine.update

Apply the following state on the cassandra client nodes:

salt -C 'I@cassandra:backup:client' state.sls openssh.client,cassandra.backup

Apply the following state on the cassandra server nodes:

salt -C 'I@cassandra:backup:server' state.sls cassandra

See also

OpenContrail 3.2: Create an instant backup of a Cassandra database¶

After you create a backup schedule as described in OpenContrail 3.2: Create a backup schedule for a Cassandra database, you may also need to create an instant backup of a Cassandra database on your OpenContrail 3.2 cluster.

To create an instant backup of a Cassandra database:

Verify that you have completed the steps described in OpenContrail 3.2: Create a backup schedule for a Cassandra database.
Log in to the Salt Master node.
Run the following state:
```
salt-call state.sls reclass
```
Log in to the OpenContrail control node that holds the Cassandra backup client role, for example, ntw01.
Run the following script:
```
/usr/local/bin/cassandra-backup-runner-call.sh
```
Note

The output for some keyspaces can return an Error statement if it does not contain any .db file. If it is not for all keyspaces, then such behaviour is normal.
Verify that a complete backup has been created on the Cassandra backup client node:
```
ls /var/backups/cassandra/full
```
Log in to the Cassandra backup server node.
Verify that the complete backup was rsynced to this node:
```
ls /var/backups/cassandra/full'
```

OpenContrail 3.2: Restore the Cassandra database¶

You may need to restore the Cassandra database after a hardware or software failure.

To restore the Cassandra database:

Log in to the Salt Master node.
Add the following lines to cluster/opencontrail/control_init.yml:
```
cassandra:
  backup:
    client:
      enabled: true
      restore_latest: 1
      restore_from: remote
```
where:
- restore_latest can have, for example, the following values:
  - 1, which means restoring the database from the last complete backup.
  - 2, which means restoring the database from the second latest complete backup.
- restore_from can have the local or remote values. The remote value uses scp to get the files from the cassandra server.

Restore the Cassandra database using the Jenkins web UI:

Add the upgrade pipeline to DriveTrain:
1. Add the following lines to cluster/cicd/control/leader.yml:
```
classes:
- system.jenkins.client.job.deploy.update.restore_cassandra
```
2. Run the salt -C 'I@jenkins:client' state.sls jenkins.client state.
Log in to the Jenkins web UI.
Open the cassandra - restore pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.

Click Deploy.

OpenContrail 4.x: Create a backup schedule for a Cassandra database¶

This section describes how to create a backup schedule for a Cassandra database for your OpenContrail 4.x cluster.

To create a backup schedule for a Cassandra database:

Configure the cassandra server role.

By default, the Cassandra backup server keeps five full backups. You can change the default settings by specifying the following parameters in cluster/<cluster_name>/infra/backup/server.yml:

parameters:
  _param:
  ...
  cassandra:
    backup:
      cron: true
      backup_dir: /srv/volumes/backup/cassandra
      server:
        enabled: true
        hours_before_full: 24
        full_backups_to_keep: 5
        key:
          cassandra_pub_key:
            enabled: true
            key: ${_param:cassandra_backup_public_key}

Enable the scheduler backup process and configure the cassandra client role:

Add the following parameters to cluster/<cluster_name>/opencontrail/control_init.yml:
```
parameters:
  _param:
  ...
  cassandra:
    backup:
      cron: true
```
By default, the Cassandra backup procedure keeps three complete backups on the cassandra client node. The rsync command moves the backup files to the Cassandra backup server.
Optional. Change the default settings by specifying the following parameters in cluster/<cluster_name>/opencontrail/control_init.yml:
```
parameters:
  _param:
  ...
  cassandra:
    backup:
      cron: true
      client:
        enabled: true
        cleanup_snaphots: true
        full_backups_to_keep: 5
        hours_before_full: 24
        ...
```
Note

Mirantis recommends setting true for the cleanup_snaphots parameter so that the backup script removes all previous database snapshots once the current database backup is done.
Verify or add the specified host as the Cassandra backup server in cluster/<cluster_name>/infra/backup/client_cassandra.yml:
```
parameters:
  _param:
    cassandra_remote_backup_server: ${_param:infra_kvm_node03_address}
```
The cassandra_remote_backup_server parameter must contain the resolvable hostname of the host on which the cassandra server is running.

Verify or add other options for the Cassandra client node in cluster/<cluster_name>/infra/backup/client_cassandra.yml:

parameters:
  ...
  cassandra:
    backup:
      cron: true
      client:
        containers:
        - opencontrail_controller_1
        enabled: true
        full_backups_to_keep: 5
        hours_before_full: 24
        target:
          host: cfg01
          backup_dir: /srv/volumes/backup/cassandra/${linux:system:name}
        backup_times:
          hour: '2'
          minute: '0'

If you customized the default parameters, verify that the hours_before_full parameter of the cassandra client in cluster/<cluster_name>/opencontrail/control_init.yml matches the same parameter of the cassandra server in cluster/<cluster_name>/infra/backup/server.yml.

Refresh pillars on the target nodes:

salt -C 'I@cassandra:backup' saltutil.refresh_pillar

Apply the salt.minion state:

salt -C 'I@cassandra:backup:client or I@cassandra:backup:server' state.sls salt.minion

Refresh grains for the cassandra client node:

salt -C 'I@cassandra:backup:client' saltutil.sync_grains

Update the mine for the cassandra client node:

salt -C 'I@cassandra:backup:client' mine.flush
salt -C 'I@cassandra:backup:client' mine.update

Apply the following state to add a Cassandra user:

salt -C 'I@cassandra:backup:server' state.apply linux.system.user

Apply the following state on the cassandra client nodes:

salt -C 'I@cassandra:backup:client' state.apply openssh.client.private_key,cassandra.backup

Apply the following state on the cassandra server nodes:

salt -C 'I@cassandra:backup:server' state.sls cassandra

See also

OpenContrail 4.x: Create an instant backup of a Cassandra database¶

After you create a backup schedule as described in OpenContrail 4.x: Create a backup schedule for a Cassandra database, you may also need to create an instant backup of a Cassandra database on your OpenContrail 4.x cluster.

To create an instant backup of a Cassandra database:

Verify that you have completed the steps described in OpenContrail 4.x: Create a backup schedule for a Cassandra database.
Log in to the Salt Master node.
Run the following state:
```
salt-call state.sls reclass
```
Log in to the OpenContrail control node that holds the Cassandra backup client role, for example, ntw01.
Run the following script:
```
/usr/local/bin/cassandra-backup-runner-call.sh
```
Note

The output for some keyspaces can return an Error statement if it does not contain any .db file. If it is not for all keyspaces, then such behaviour is normal.
Verify that a complete backup has been created on the Cassandra backup client node:
```
ls /var/backups/cassandra/full
```
Log in to the Cassandra backup server node.
Verify that the complete backup was rsynced to this node:
```
ls /srv/volumes/backup/cassandra/full
```

OpenContrail 4.x: Restore the Cassandra database¶

You may need to restore the Cassandra database after a hardware or software failure.

Warning

During the restore procedure, all current Cassandra data is deleted. Therefore, starting from the MCP 2019.2.5 maintenance update, a database backup in Cassandra is not created before the restore procedure. If a backup of current data is required, you can create an instant backup. For details, see: OpenContrail 4.x: Create an instant backup of a Cassandra database.

To restore the Cassandra database:

Log in to the Salt Master node.
Open your project Git repository with the Reclass model on the cluster level.
Add the following snippet to cluster/<cluster_name>/infra/backup/client_cassandra.yml:
```
cassandra:
  backup:
    client:
      enabled: true
      restore_latest: 1
      restore_from: remote
```
where:
- restore_latest can have, for example, the following values:
  - 1, which means restoring the database from the last complete backup.
  - 2, which means restoring the database from the second latest complete backup.
- restore_from can have the local or remote values. The remote value uses scp to get the files from the cassandra server.

Restore the Cassandra database using the Jenkins web UI:

Verify that the following class is present in cluster/cicd/control/leader.yml:

classes:
- system.jenkins.client.job.deploy.update.restore_cassandra

If you manually add this class, apply the changes:

salt -C 'I@jenkins:client' state.sls jenkins.client

Log in to the Jenkins web UI.
Open the cassandra - restore pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.

Click Deploy.

Back up and restore a ZooKeeper database¶

MCP uses a custom script to back up a ZooKeeper database. ZooKeeper is a part of every OpenContrail deployment and the backup utility includes the following components:

ZooKeeper server stores ZooKeeper backups rsynced from the ZooKeeper client nodes and runs on any node, for example, the Salt Master node.
ZooKeeper client sends database backups to the ZooKeeper server and always runs on all the nodes of the ZooKeeper cluster. However, the back up is performed only on the leader node.

This section describes how to create and restore the ZooKeeper cluster on your OpenContrail 3.2 or 4.x cluster.

OpenContrail 3.2: Create a backup schedule for a ZooKeeper database¶

This section describes how to create a backup schedule for a ZooKeeper database on an OpenContrail 3.2 cluster.

To create a backup schedule for a ZooKeeper database:

Configure the zookeeper server role by adding the following class to cluster/infra/config.yml:

classes:
- system.zookeeper.backup.server.single
parameters:
  _param:
     zookeeper_backup_public_key: <generate_your_keypair>

By default, adding this include statement results in ZooKeeper backup server keeping five full backups. To change the default setting, include the following pillar to cluster/infra/config.yml.

parameters:
  zookeeper:
    backup:
      server:
        enabled: true
        hours_before_full: 24
        full_backups_to_keep: 5

Configure the zookeeper client role by adding the following lines to cluster/opencontrail/control.yml:
```
classes:
- system.zookeeper.backup.client.single
parameters:
  _param:
    zookeeper_remote_backup_server: cfg01
    root_private_key: |
      <generate_your_keypair>
```
By default, adding this include statement results in ZooKeeper keeping three complete backups on the zookeeper client node. The rsync command moves the backup files to the Salt Master node. To change the default setting, include the following pillar to cluster/opencontrail/control.yml:
```
parameters:
  zookeeper:
    backup:
      client:
        enabled: true
        full_backups_to_keep: 3
        hours_before_full: 24
        target:
          host: cfg01
```
Note

The target.host parameter must contain the resolvable hostname of the host where the zookeeper server is running.
If you customized the default parameters, verify that the hours_before_full parameter of the zookeeper client in cluster/opencontrail/control.yml matches the same parameter of the zookeeper server in cluster/infra/config.yml.
Run the following command on the Salt Master node:
```
salt '*' saltutil.refresh_pillar
```

Apply the salt.minion state:

salt -C 'I@zookeeper:backup:client or I@zookeeper:backup:server' state.sls salt.minion

Refresh grains for the zookeeper client node:

salt -C 'I@zookeeper:backup:client' saltutil.sync_grains

Update the mine for the zookeeper client node:

salt -C 'I@zookeeper:backup:client' mine.flush
salt -C 'I@zookeeper:backup:client' mine.update

Apply the following state for the zookeeper client node:

salt -C 'I@zookeeper:backup:client' state.sls openssh.client,zookeeper.backup

Apply the following states for the zookeeper server node:

salt -C 'I@zookeeper:backup:server' state.apply linux.system
salt -C 'I@zookeeper:backup:server' state.sls zookeeper.backup

See also

OpenContrail 3.2: Create an instant backup of ZooKeeper¶

After you create a backup schedule as described in OpenContrail 3.2: Create a backup schedule for a ZooKeeper database, you may also need to create an instant backup of a ZooKeeper database on an OpenContrail 3.2 cluster.

To create an instant backup of a ZooKeeper database:

Verify that you have completed the steps described in OpenContrail 3.2: Create a backup schedule for a ZooKeeper database.

Find out which OpenContrail control node is the ZooKeeper leader.

salt -C 'I@opencontrail:control' cmd.run 'echo stat | nc localhost 2181 | grep leader'

Run the following script:

/usr/local/bin/zookeeper-backup-runner.sh

Verify that a complete backup has been created on the OpenContrail control leader node:
```
ls /var/backups/zookeeper/full
```
Log in to the zookeeper server node and verify that the complete backup was rsynced to this node by executing the following command:
```
ls /var/backups/zookeeper/full
```

OpenContrail 3.2: Restore a ZooKeeper database¶

You may need to restore a ZooKeeper database after a hardware or software failure.

To restore a ZooKeeper database:

Log in to the Salt Master node.
Add the following lines to cluster/opencontrail/control.yml:
```
zookeeper:
  backup:
    client:
      enabled: true
      restore_latest: 1
      restore_from: remote
```
where:
- restore_latest can have, for example, the following values:
  - 1, which means restoring the database from the last complete backup.
  - 2, which means restoring the database from the second latest complete backup.
- restore_from can have the local or remote values. The remote value uses scp to get the files from the zookeeper server.

Proceed either with automatic restore steps using the Jenkins web UI pipeline or with manual restore steps:

Automatic restore steps:

Add the upgrade pipeline to DriveTrain:
1. Add the following lines to cluster/cicd/control/leader.yml:
```
classes:
- system.jenkins.client.job.deploy.update.restore_zookeeper
```
2. Run the salt -C 'I@jenkins:client' state.sls jenkins.client state.
Log in to the Jenkins web UI.
Open the zookeeper - restore pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.

Click Deploy.

Manual restore steps:
1. Stop the supervisor-config service on the OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' service.stop supervisor-config
```
2. Stop the supervisor-control service on the OpenContrail control nodes:
```
salt -C 'I@opencontrail:control' service.stop supervisor-control
```
3. Stop the zookeeper service on the OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' service.stop zookeeper
```
4. Remove the Zookeeper files on OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' cmd.run 'rm -rf /var/lib/zookeeper/version-2/*'
```
5. Run the zookeeper state.
  
  Mirantis recommends running the following command using Linux GNU Screen or alternatives.
```
salt -C 'I@opencontrail:control' cmd.run "su root -c 'salt-call state.sls zookeeper'"
```
  This state restores the databases and creates a file in /var/backups/zookeeper/dbrestored.
  
  Caution
  
  If you rerun the state, it will not restore the database again. To repeat the restore procedure, first delete the /var/backups/zookeeper/dbrestored file and then rerun the zookeeper state again.
6. Start the zookeeper service on the OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' service.start zookeeper
```
7. Start the supervisor-config service on the OpenContrail control nodes:
  salt -C ‘I@opencontrail:control’ service.start supervisor-config
8. Start the supervisor-control service on the OpenContrail control nodes:
```
salt -C 'I@opencontrail:control' service.start supervisor-control
```
9. Start the zookeeper service on the OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' service.start zookeeper
```
10. Verify that the Zookeeper files are present again on OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' cmd.run 'ls /var/lib/zookeeper/version-2'
```
11. Wait 60 seconds and verify that the OpenContrail is in correct state on OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' cmd.run 'contrail-status'
```

OpenContrail 4.x: Create a backup schedule for a ZooKeeper database¶

This section describes how to create a backup schedule for a ZooKeeper database on an OpenContrail 4.x cluster.

To create a backup schedule for a ZooKeeper database:

Configure the zookeeper server role using the following parameters in cluster/<cluster_name>/infra/backup/server.yml:

parameters:
  zookeeper:
    backup:
      cron: true
      server:
        enabled: true
        hours_before_full: 24
        full_backups_to_keep: 5
        backup_dir: /srv/volumes/backup/zookeeper

By default, ZooKeeper backup server keeps five full backups.

Configure the zookeeper client role by adding the following parameters to cluster/<cluster_name>/infra/backup/client_zookeeper.yml:
```
parameters:
  _param:
    zookeeper_remote_backup_server: ${_param:infra_kvm_node03_address}
  zookeeper:
    backup:
      cron: True
      client:
        enabled: true
        full_backups_to_keep: 3
        hours_before_full: 24
        containers:
        - opencontrail_controller_1
```
Caution

The zookeeper_remote_backup_server parameter must contain the resolvable hostname of the host on which the zookeeper server is running.

The ZooKeeper backup process keeps three complete backups on the zookeeper client node. The rsync command moves the backup files to Zookeeper server node.
If you customized the default parameters, verify that the hours_before_full parameter of the zookeeper client in cluster/<cluster_name>/infra/backup/client_zookeeper.yml matches the same parameter of the zookeeper server in cluster/<cluster_name>/infra/backup/server.yml.
Run the following command on the Salt Master node:
```
salt '*' saltutil.refresh_pillar
```

Apply the salt.minion state:

salt -C 'I@zookeeper:backup:client or I@zookeeper:backup:server' state.sls salt.minion

Refresh grains for the zookeeper client node:

salt -C 'I@zookeeper:backup:client' saltutil.sync_grains

Update the mine for the zookeeper client node:

salt -C 'I@zookeeper:backup:client' mine.flush
salt -C 'I@zookeeper:backup:client' mine.update

Apply the following state on the zookeeper client node:

salt -C 'I@zookeeper:backup:client' state.sls openssh.client,zookeeper.backup

Apply the following state on the zookeeper server node:

salt -C 'I@zookeeper:backup:server' state.sls zookeeper.backup

See also

OpenContrail 4.x: Create an instant backup of ZooKeeper¶

After you create a backup schedule as described in OpenContrail 4.x: Create a backup schedule for a ZooKeeper database, you may also need to create an instant backup of a ZooKeeper database on an OpenContrail 4.x cluster.

To create an instant backup of a ZooKeeper database:

Verify that you have completed the steps described in OpenContrail 4.x: Create a backup schedule for a ZooKeeper database.

Find out which OpenContrail control node is the ZooKeeper leader.

salt -C 'I@opencontrail:control' cmd.run 'echo stat | nc localhost 2181 | grep leader'

Run the following script:

/usr/local/bin/zookeeper-backup-runner.sh

Verify that a complete backup has been created on the OpenContrail control leader node:
```
ls /var/backups/zookeeper/full
```
Log in to the zookeeper server node and verify that the complete backup was rsynced to this node by executing the following command:
```
ls /srv/volumes/backup/zookeeper/full
```

OpenContrail 4.x: Restore a ZooKeeper database¶

You may need to restore a ZooKeeper database after a hardware or software failure.

To restore a ZooKeeper database:

Log in to the Salt Master node.
Open your project Git repository with the Reclass model on the cluster level.
Add the following snippet to cluster/<cluster_name>/infra/backup/client_zookeeper.yml:
```
zookeeper:
  backup:
    client:
      enabled: true
      restore_latest: 1
      restore_from: remote
```
where:
- restore_latest can have, for example, the following values:
  - 1, which means restoring the database from the last complete backup.
  - 2, which means restoring the database from the second latest complete backup.
- restore_from can have the local or remote values. The remote value uses scp to get the files from the zookeeper server.

Proceed either with automatic restore steps using the Jenkins web UI pipeline or with manual restore steps:

Automatic restore steps:

Verify that the following class is present in cluster/cicd/control/leader.yml:

classes:
- system.jenkins.client.job.deploy.update.restore_zookeeper

If you manually add this class, apply the changes:

salt -C 'I@jenkins:client' state.sls jenkins.client

Log in to the Jenkins web UI.
Open the zookeeper - restore pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.

Click Deploy.

Manual restore steps:

Stop the config services on the OpenContrail control nodes:

salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-api'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-schema'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-svc-monitor'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-device-manager'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-config-nodemgr'

Stop the control services on the OpenContrail control nodes:

salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-control'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-named'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-dns'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl stop contrail-control-nodemgr'

Stop the zookeeper service on the OpenContrail controller nodes:

salt -C 'I@opencontrail:control' cmd.run 'doctrail controller service zookeeper stop'

Remove the ZooKeeper files from the OpenContrail controller nodes:

salt -C 'I@opencontrail:control' cmd.run 'rm -rf /var/lib/config_zookeeper_data/version-2/*'

Run the zookeeper state.

Mirantis recommends running the following command using Linux GNU Screen or alternatives.
```
salt -C 'I@opencontrail:control' state.apply zookeeper.backup
```
This state restores the databases and creates a file in /var/backups/zookeeper/dbrestored.

Caution

If you rerun the state, it will not restore the database again. To repeat the restore procedure, first delete the /var/backups/zookeeper/dbrestored file and then rerun the zookeeper state again.

Verify that the ZooKeeper files are present again on the OpenContrail controller nodes:

salt -C 'I@opencontrail:control' cmd.run 'doctrail controller ls /var/lib/zookeeper/version-2'

Start the zookeeper service on the OpenContrail controller nodes:

salt -C 'I@opencontrail:control' cmd.run 'doctrail controller service zookeeper start'

Start the config services on the OpenContrail controller nodes:

salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-api'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-schema'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-svc-monitor'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-device-manager'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-config-nodemgr'

Start the control services on the OpenContrail controller nodes:

salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-control'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-named'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-dns'
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller systemctl start contrail-control-nodemgr'

Wait 60 seconds and verify that the OpenContrail is in correct state on OpenContrail controller nodes:
```
salt -C 'I@opencontrail:control' cmd.run 'doctrail controller contrail-status'
```

Back up and restore a MAAS PostgreSQL database¶

The Mirantis Cloud Platform (MCP) uses the PostgreSQL database to store all state information for the MAAS server provisioning tool of MCP. Mirantis recommends backing up your MAAS PostgreSQL databases daily to ensure the integrity of your data. You may also need to create a database backup for testing purposes.

MCP uses the Backupninja utility to back up and restore the MAAS PostgreSQL database. You can also use Backupninja to copy the MAAS PostgreSQL databases.

Backupninja installs automatically with your cloud environment using a SaltStack formula and includes the following components:

Backupninja server collects requests from the Backupninja client nodes and runs on any node, for example, the Salt Master node.
Backupninja client sends database backups to the Backupninja server and runs on the database cluster nodes.

Backupninja collects and backs up all the MAAS PostgreSQL database data to successfully restore the system from scratch.

This section describes how to create and restore your MAAS PostgreSQL databases using Backupninja.

Back up a MAAS PostgreSQL database prior to 2019.2.5¶

This section describes how to back up a MAAS PostgreSQL database prior to the MCP 2019.2.5 maintenance update.

Enable a backup schedule for a MAAS PostgreSQL database using Backupninja¶

This section describes how to create a backup schedule for a MAAS PostgreSQL database using the Backupninja utility. By default, Backupninja runs daily at 1.00 AM.

To create a backup schedule for a MAAS PostgreSQL database:

Add the following lines to cluster/infra/maas.yml:

classes:
- system.backupninja.client.single
- system.openssh.client.root
parameters:
  _param:
    backupninja_backup_host: <IP>
    root_private_key: |
      <generate_your_keypair>

Note

The backupninja_backup_host parameter is the backupninja server that runs on any server, for example, on the Salt Master node.

Include the following pillar to the node that runs the backupninja server and will store the database backups remotely:

classes:
- system.backupninja.server.single
parameters:
  _param:
     backupninja_public_key: <generate_your_keypair>

Optionally, override the default configuration of the backup schedule as described in Configure a backup schedule for a MAAS PostgreSQL database.

Apply the salt.minion state:

salt -C 'I@backupninja:client or I@backupninja:server' state.sls salt.minion

Refresh grains for the backupninja client node:

salt -C 'I@backupninja:client' saltutil.sync_grains

Update the mine for the backupninja client node:

salt -C 'I@backupninja:client' mine.flush
salt -C 'I@backupninja:client' mine.update

Apply the backupninja state on the backupninja client node:
```
salt -C 'I@backupninja:client' state.sls backupninja
```
Applying this state creates two backup configuration scripts. By default, the scripts are stored in the /etc/backup.d/ directory and will run daily at 1:00 AM.

Refresh grains for the backupninja server node:

salt -C 'I@backupninja:server' state.sls salt.minion.grains

Apply the backupninja state on the backupninja server node:

salt -C 'I@backupninja:server' state.sls backupninja

Configure a backup schedule for a MAAS PostgreSQL database¶

By default, Backupninja runs daily at 1.00 AM. This section describes how to override the default configuration of the backup schedule. To enable the backup schedule in your deployment, refer to Enable a backup schedule for a MAAS PostgreSQL database using Backupninja.

To configure a backup schedule for a MAAS PostgreSQL database:

Log in to the Salt Master node.
Edit the cluster/infra/maas.yml file as required. Select from the following options:
- Set the exact time of the backup server role to override the default backup time. The backup_times parameters include:
  - day_of_week
    The day of a week to perform backups. Specify 0 for Sunday, 1 for Monday, and so on. If not set, defaults to *.
  - day_of_month
    The day of a month to perform backups. For example, 20, 25, and so on. If not set, defaults to *.
    
    Note
    
    Only day_of_week or day_of_month can be active at the same time. If both are defined, day_of_week is prioritized.
  - hour
    The hour to perform backups. Uses the 24 hour format. If not defined, defaults to 1.
  - minute
    The minute to perform backups. For example, 5, 10, 59, and so on. If not defined, defaults to 00.
  Configuration example:
```
parameters:
  _param:
    backupninja:
      enabled: true
      client:
        backup_times:
          day_of_week: 1
          hour: 15
          minute: 45
```
  Note
  
  These settings will change global Backupninja schedule. If not set differently for individual steps, it will run all steps in the right order. This is recommended way of defining exact backup order.
- Disable automatic backups:
```
parameters:
  _param:
    backupninja:
      enabled: true
      client:
        auto_backup_disabled: true
```
- Re-enable automatic backups by setting the auto_backup_disabled parameter to false or delete the related line in cluster/infra/maas.yml:
```
parameters:
  _param:
    backupninja:
      enabled: true
      client:
        auto_backup_disabled: false
```
Apply changes by performing the steps 5-10 of the Enable a backup schedule for a MAAS PostgreSQL database using Backupninja procedure.

Back up and restore a MAAS PostgreSQL database starting from 2019.2.5¶

This section describes how to back up and restore a MAAS PostgreSQL database starting from the MCP 2019.2.5 maintenance update.

Enable a backup schedule for a MAAS PostgreSQL database using Backupninja¶

This section describes how to create a backup schedule for a MAAS PostgreSQL database using the Backupninja utility. By default, Backupninja runs daily at 1.00 AM.

To create a backup schedule for a MAAS PostgreSQL database:

Log in to the Salt Master node.
Add the following lines to cluster/infra/maas.yml:
```
classes:
- system.backupninja.client.single
- system.openssh.client.root
parameters:
  _param:
    backupninja_backup_host: <IP>
    root_private_key: |
      <generate_your_keypair>
```
Note

The backupninja_backup_host parameter is the IP address or the host name of a node running the backupninja server. To obtain this IP address or host name, run salt -C 'I@backupninja:server' grains.item fqdn_ip4 or salt -C 'I@backupninja:server' grains.item fqdn respectively.

Verify that root_private_key exists:

salt-call pillar.get '_param:root_private_key'

If the root_private_key parameter is missing, include the following pillar to the node that runs the backupninja server and will store the database backups remotely:
```
classes:
- system.backupninja.server.single
parameters:
  _param:
     backupninja_public_key: <generate_your_keypair>
```
Optionally, override the default configuration of the backup schedule as described in Configure a backup schedule for a MAAS PostgreSQL database.

Apply the salt.minion state:

salt -C 'I@backupninja:client or I@backupninja:server' state.sls salt.minion

Refresh grains for the backupninja client node:

salt -C 'I@backupninja:client' saltutil.sync_grains

Update the mine for the backupninja client node:

salt -C 'I@backupninja:client' mine.update

Apply the backupninja state on the backupninja client node:
```
salt -C 'I@backupninja:client' state.sls backupninja
```
Applying this state creates two backup configuration scripts. By default, the scripts are stored in the /etc/backup.d/ directory and will run daily at 1:00 AM.

Refresh grains for the backupninja server node:

salt -C 'I@backupninja:server' state.sls salt.minion.grains

Apply the backupninja state on the backupninja server node:

salt -C 'I@backupninja:server' state.sls backupninja

Configure a backup schedule for a MAAS PostgreSQL database¶

Warning

This configuration presupposes manual backups or backups performed by a cronjob. If you use the Backupninja backup pipeline job, see Configure the Backupninja backup pipeline.

To configure a backup schedule for a MAAS PostgreSQL database:

Log in to the Salt Master node.
Edit the cluster/infra/maas.yml file as required. Select from the following options:
- Set the exact time of the backup server role to override the default backup time. The backup_times parameters include:
  - day_of_week
    The day of a week to perform backups. Specify 0 for Sunday, 1 for Monday, and so on. If not set, defaults to *.
  - day_of_month
    The day of a month to perform backups. For example, 20, 25, and so on. If not set, defaults to *.
    
    Note
    
    Only day_of_week or day_of_month can be active at the same time. If both are defined, day_of_week is prioritized.
  - hour
    The hour to perform backups. Uses the 24 hour format. If not defined, defaults to 1.
  - minute
    The minute to perform backups. For example, 5, 10, 59, and so on. If not defined, defaults to 00.
  Configuration example:
```
parameters:
  _param:
    backupninja:
      enabled: true
      client:
        backup_times:
          day_of_week: 1
          hour: 15
          minute: 45
```
  Note
  
  These settings will change global Backupninja schedule. If not set differently for individual steps, it will run all steps in the right order. This is recommended way of defining exact backup order.
- Disable automatic backups:
```
parameters:
  _param:
    backupninja:
      enabled: true
      client:
        auto_backup_disabled: true
```
- Re-enable automatic backups by setting the auto_backup_disabled parameter to false or delete the related line in cluster/infra/maas.yml:
```
parameters:
  _param:
    backupninja:
      enabled: true
      client:
        auto_backup_disabled: false
```
Apply changes by performing the steps 5-10 of the Enable a backup schedule for a MAAS PostgreSQL database using Backupninja procedure.

Create an instant backup of a MAAS PostgreSQL database¶

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

After you create a backup schedule, you may also need to create an instant backup of a MAAS ProgreSQL database.

To create an instant backup of a MAAS PostgreSQL database using the Backupninja service:

Verify that you have completed the steps described in Enable a backup schedule for a MAAS PostgreSQL database using Backupninja.
Select one of the following options:
- Create an instant backup of a MAAS PostgreSQL database automatically as described in Create an instant backup using Backupninja pipeline.
- Create an instant backup of a MAAS PostgreSQL database manually:
  1. Verify that you have the postgresql-client-9.6 package installed on the Salt Master node. Otherwise, install it:
    salt -C 'I@backupninja:client' pkg.install postgresql-client,postgresql-client-9.6
  2. Verify that you have completed the steps described in Enable a backup schedule for a MAAS PostgreSQL database using Backupninja.
  3. Log in to the MAAS node, for example, cfg01.
  4. Make a backup of file_permissions.txt for MAAS:
    which getfacl && getfacl -pR /var/lib/maas/ > /var/lib/maas/file_permissions.txt
  5. Compress all MAAS PostgreSQL databases and store them in /var/backups/postgresql/:
    backupninja -n --run /etc/backup.d/102.pgsql
  6. Move the local backup files to the backupninja server using rsync:
    backupninja -n --run /etc/backup.d/200.backup.rsync

Restore the MAAS PostgreSQL database using Backupninja¶

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

You may need to restore a MAAS PostgreSQL database after a hardware or software failure or if you want to create a clone of the existing database from a backup. This section instructs you on how to run a restore of a MAAS PostgreSQL database using the Backupninja service.

To restore a MAAS PostgreSQL database using the Backupninja service:

Select one of the following options:

Restore the MAAS PostgreSQL database automatically as described in Restore the services using Backupninja pipeline.
Restore the MAAS PostgreSQL database manually:
1. Log in to the Salt Master node.
2. Include the following pillar to cluster/infra/maas.yml to restore the MAAS PostgreSQL database from any database node:
```
classes:
- system.maas.region.single
- system.maas.region.restoredb
parameters:
   _param:
     backupninja_backup_host: <IP>
```
3. Apply the maas.region state to the corresponding nodes with MAAS.
  
  Mirantis recommends running the following command using Linux GNU Screen or alternatives.
```
salt -C I@maas:region state.sls maas.region
```
  Running this state restores the databases and creates a file for every restored database in /root/maas/flags.
  
  Caution
  
  If you rerun the state, it will not restore the database again. To repeat the restore procedure for any database, first delete the database file from /root/maas/flags and then rerun the maas.region state again.

Back up and restore the Dogtag server files and database¶

This section describes how to back up and restore the Dogtag server files and database.

Back up the Dogtag server files and database prior to 2019.2.6¶

This section describes how to back up the Dogtag server files and database prior to the MCP 2019.2.6 maintenance update.

Prepare for the Dogtag backup¶

This section describes how to prepare your environment for the backup of the Dogtag server files and database.

To prepare your environment for the Dogtag backup:

Log in to Salt Master node.
Open the cluster Reclass level of your deployment.

In openstack/barbican.yml, add the following configuration:

classes:
- system.openssh.client.root
parameters:
  backupninja:
    client:
      enabled: false

Apply the openssh state on the nodes with Dogtag to create the private key:
```
salt -C 'I@dogtag:server' state.sls openssh
```
Verify that the private key has been created with correct permissions:
```
salt -C 'I@dogtag:server' cmd.run 'ls -la /root/.ssh'
```
The file with the following permissions should be present:
```
-r-------- 1 root root <DATE> <TIME> id_rsa
```

Update authenticated hosts on the backup server node:

salt -C '*' state.sls salt.minion.grains
salt -C '*' mine.update
salt -C 'I@backupninja:server' state.sls backupninja

Verify that the authorized_keys file on the backup server node contains the IP addresses for the kmn nodes:

salt -C 'I@backupninja:server' cmd.run "su -l backupninja -c 'cat ~/.ssh/authorized_keys'"

Example of system response:

no-pty,from="10.11.0.67,10.11.0.66,10.11.0.15,10.11.0.65" ssh-rsa AAAAB3NzaC.....6tS7iqfkZ

Note

Note the IP addresses from the from parameter.

Now, you can proceed with Back up the Dogtag server files and database.

Back up the Dogtag server files and database¶

After you complete Prepare for the Dogtag backup, you can perform the Dogtag backup.

Warning

The content of a Dogtag database is tightly connected with the content of a Barbican database running on the Galera cluster. Therefore, we recommend running backups of Dogtag and Barbican simultaneously.

To back up the Dogtag server files and database:

Obtain the host name of the remote server node:

salt -C 'I@backupninja:server' pillar.get "linux:network:fqdn"

Create a backup directory:

salt -C 'I@dogtag:server and *01*' cmd.run "mkdir -p /var/backups/dogtag"

Export the signing certificate and key:

Export the credentials. For example:

salt -C 'I@dogtag:server and *01*' cmd.run "grep internal= /var/lib/pki/pki-tomcat/conf/password.conf | awk -F= '{print $2}' > /etc/dogtag/internal.txt"
salt -C 'I@dogtag:server and *01*' cmd.run "grep internaldb= /var/lib/pki/pki-tomcat/conf/password.conf | awk -F= '{print $2}' > /etc/dogtag/pass.txt"

Export the certificate. For example:

salt -C 'I@dogtag:server and *01*' cmd.run "PKCS12Export -debug -d /var/lib/pki/$PKINAME/alias -p /etc/dogtag/internal.txt -o /etc/dogtag/ca-certs.p12 -w /etc/dogtag/pass.txt"

Remove the internal password file:

salt -C 'I@dogtag:server and *01*' cmd.run "rm -f /etc/dogtag/internal.txt"

Export CSR:

salt -C 'I@dogtag:server and *01*' cmd.run "echo '-----BEGIN NEW CERTIFICATE REQUEST-----' > /etc/dogtag/ca_signing.csr"
salt -C 'I@dogtag:server and *01*' cmd.run "sed -n '/^ca.signing.certreq=/ s/^[^=]*=// p' < /var/lib/pki/pki-tomcat/ca/conf/CS.cfg >> /etc/dogtag/ca_signing.csr"
salt -C 'I@dogtag:server and *01*' cmd.run "echo '-----END NEW CERTIFICATE REQUEST-----' >> /etc/dogtag/ca_signing.csr"

Run the database backup:

salt -C 'I@dogtag:server and *01*' cmd.run "/usr/sbin/db2bak-online -Z pki-tomcat -j /etc/dogtag/pass.txt -A /var/lib/dirsrv/slapd-pki-tomcat/bak"

Note the backup directory from the system response. For example:

Back up directory: /var/lib/dirsrv/slapd-pki-tomcat/bak/pki-tomcat-2019_8_14_11_28_25

Remove the Dogtag password file:

salt -C 'I@dogtag:server and *01*' cmd.run "rm -f /etc/dogtag/pass.txt"

Create a .tar archive that contains the Dogtag server files. For example:

salt -C 'I@dogtag:server and *01*' cmd.run "tar czvf /var/backups/dogtag/dogtag_backup-$(date +%F_%H-%M-%S).tar.gz --ignore-failed-read -C / \
etc/pki/pki-tomcat \
etc/dogtag/ca-certs.p12 \
etc/dogtag/ca_signing.csr \
etc/sysconfig/pki-tomcat \
etc/sysconfig/pki/tomcat/pki-tomcat \
etc/systemd/system/pki-tomcatd.target.wants/pki-tomcatd@pki-tomcat.service \
var/lib/pki/pki-tomcat \
var/log/pki/pki-tomcat \
usr/share/pki/server/conf/database.conf \
usr/share/pki/server/conf/schema.conf

Create the remote backup directory. For example:

salt -C 'I@backupninja:server' cmd.run "mkdir -p /srv/volumes/backup/backupninja/dogtag"

Transfer the data to the remote node. For example:
```
salt -C 'I@dogtag:server and *01*' cmd.run "/usr/bin/rsync -rhtPpv --rsync-path=rsync --progress /var/backups/dogtag/* -e ssh backupninja@<remote_node>:/srv/volumes/backup/backupninja/dogtag"
```
Note

The command above transfers all backups from the backup directory, and not just the last one, unless they are already present on the remote node.
Verify that the file transfer has been completed:
```
salt -C 'I@backupninja:server' cmd.run "ls -l /srv/volumes/backup/backupninja/dogtag"
```
The backup directory should include the dogtag_backup-<some_timestamp>.tar.gz file.

Back up and restore the Dogtag server files and database starting from 2019.2.6¶

This section describes how to back up the Dogtag server files and database starting from the MCP 2019.2.6 maintenance update. MCP uses the Backupninja utility to back up the Dogtag server files and database.

Note

The Dogtag restore procedure is available starting from the 2019.2.11 maintenance update.

Prepare for the Dogtag backup and restore¶

This section describes how to prepare your environment for the backup and restore of the Dogtag server files and database.

To prepare for the Dogtag backup and restore:

Log in to the Salt Master node.
Open the cluster Reclass level of your deployment.

Verify that the cluster/<cluster_name>/infra/backup/client_dogtag.yml file with the Backupninja configuration for Dogtag exists. Otherwise, create such file with the following content:

classes:
- system.backupninja.client.single
- cluster.<cluster_name>.infra.backup.allow_root_ssh
parameters:
  _param:
    backupninja_backup_host: ${_param:infra_kvm_node03_address}
  backupninja:
    client:
      target:
        home_dir: /srv/volumes/backup/backupninja
        engine: rsync
        engine_opts: "-av --delete --recursive --safe-links"
      log:
        color: ${_param:backupninja_color_log}

In openstack/barbican.yml, add the client_dogtag file if it is not included:
```
- cluster.<cluster_name>.infra.backup.client_dogtag
```

Refresh grains:

salt -C '*' state.sls salt.minion.grains
salt -C '*' mine.update

Apply the openssh state on the Dogtag server nodes:

salt -C 'I@dogtag:server and I@backupninja:client' state.sls openssh

Apply the backupninja state on the Dogtag server nodes:

salt -C 'I@dogtag:server and I@backupninja:client' state.sls backupninja

Apply the backupninja state on the Backupninja server node:

salt -C 'I@backupninja:server' state.sls backupninja

Now, you can proceed with Back up the Dogtag server files and database.

Back up the Dogtag server files and database¶

After you complete Prepare for the Dogtag backup and restore, you can perform the Dogtag backup.

Warning

The content of a Dogtag database is tightly connected with the content of a Barbican database running on a Galera cluster. Therefore, we recommend running backups of Dogtag and Barbican simultaneously. For this reason, the Backupninja backup pipeline is triggered to back up Dogtag always once the backup of a Galera cluster using the Galera database backup pipeline is finished.

To back up the Dogtag server files and database:

Verify that you have completed the steps described in Prepare for the Dogtag backup and restore.
Select one of the following options:
- Create an instant backup of the Dogtag files and database automatically as described in Create an instant backup using Backupninja pipeline.
- Create an instant backup of the Dogtag files and database manually:
  1. Log in to the Salt Master node.
  2. Run the backup script:
    salt -C 'I@dogtag:server:role:master' cmd.run "su root -c 'backupninja --now -d'"

Restore the Dogtag server files and database¶

After you complete the Back up the Dogtag server files and database procedure, you can perform the Dogtag restore.

Note

The Dogtag restore procedure is available starting from the 2019.2.11 maintenance update.
If Dogtag restore has been already performed on your cluster, first delete the .dogtag_restored file in /etc/salt.

To restore the Dogtag server files and database:

Verify that you have completed the steps described in Prepare for the Dogtag backup and restore.

Select from the following options:

Restore Dogtag files and database automatically as described in Restore the services using Backupninja pipeline.

Restore Dogtag files and database manually:

In cluster/openstack/barbican.yml, configure the following pillar:

parameters:
  dogtag:
    server:
      initial_data:
        engine: rsync
        source: ${_param:backupninja_backup_host}
        host: ${_param:openstack_barbican_node01_hostname}.${_param:cluster_domain}
        home_dir: '/path/to/backups/' # defaults to '/srv/volumes/backup/backupninja' if not defined

Stop the database service on secondary nodes to temporarily disable replication:

salt -C 'I@dogtag:server:role:slave' service.stop 'dirsrv@pki-tomcat.service'

Apply the restore state.

Mirantis recommends running the following command using Linux GNU Screen or alternatives.
```
salt -C 'I@dogtag:server:role:master' state.sls dogtag.server.restore
```

Start the database service on secondary nodes to enable replication:

salt -C 'I@dogtag:server:role:slave' service.start 'dirsrv@pki-tomcat.service'

Verify that all services are running:

salt -C 'I@dogtag:server' service.status "dirsrv@pki-tomcat.service"

Expected system response:

kmn02.example.domain.local:
  True
kmn03.example.domain.local:
  True
kmn01.example.domain.local:
  True

Apply the states below in the following strict order:

salt -C 'I@salt:master' state.sls salt,reclass
salt -C 'I@dogtag:server and *01*' state.sls dogtag.server
salt -C 'I@dogtag:server' state.sls dogtag.server
salt -C 'I@haproxy:proxy' state.sls haproxy
salt -C 'I@barbican:server and *01*' state.sls barbican.server
salt -C 'I@barbican:server' state.sls barbican.server
salt -C 'I@barbican:server' cmd.run "rm /etc/barbican/alias/*"
salt -C 'I@barbican:server' service.restart apache2

If the barbican.server state fails with the "Rendering SLS 'base:barbican.server' failed: Jinja variable 'dict object' has no attribute 'key'" error that may occur, for example, due to the mine data deletion after calling the mine.flush function:

Obtain the Dogtag certificate location:

salt -C 'I@dogtag:server:role:master' pillar.get dogtag:server:export_pem_file_path

Example of system response:

/etc/dogtag/kra_admin_cert.pem

Apply the following state:

Note

In the state below, substitute the certificate path with the one you obtained in the previous step.

salt -C 'I@dogtag:server:role:master' mine.send dogtag_admin_cert \
mine_function=cmd.run 'cat /etc/dogtag/kra_admin_cert.pem'

Rerun the failed Barbican state.

Configure and use the Backupninja pipeline¶

This section describes the Backupninja pipelines. Before proceeding, verify that you have performed the steps described in the following sections depending on your use case:

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Configure the Backupninja backup pipeline¶

The Backupninja backup pipeline supports backup of the following components:

MAAS, including the PostgreSQL database that contains the data for MAAS and machines that are registered and managed by MAAS
Salt Master, including the PKI keys, certificates, the Salt Minion on cfg01, and the metadata model
Dogtag server files and database

To use the Backupninja backup pipeline, you must override the current Backupninja configuration and disable the default backup schedule. This section describes how to configure the Backupninja backup pipeline.

To configure the Backupninja backup pipeline:

Verify that Backupninja scheduling is set to manual:

salt-call pillar.get backupninja:client

Example of system response:

parameters:
  backupninja:
    client:
      scheduling:
        when:
          - manual

If the pillars do not include the parameter above or it has a different value:
1. Manually override the value by adding the following block to infra/backup/client_common.yml.
```
parameters:
  backupninja:
    client:
      scheduling:
        when:
          - manual
```
2. Apply the backupninja state:
```
salt -C 'I@backupninja:client' state.sls backupninja
```

Verify that the class with the Backupninja backup Jenkins job is present in the pillar:

salt -C 'I@jenkins:client and not I@salt:master' pillar.get jenkins:client:job:backupninja_backup

If the pillars do not include the Backupninja backup pipeline:
1. Add the following class to cicd/control/leader.yml:
```
- system.jenkins.client.job.deploy.backupninja_backup
```
2. Rerun the jenkins state on the cid01 node:
```
salt -C 'I@jenkins:client and not I@salt:master' state.sls jenkins.client
```

Configure the Backupninja restore pipeline¶

The Backupninja restore pipeline supports restore of the following components:

MAAS, including the PostgreSQL database that contains the data for MAAS and machines that are registered and managed by MAAS
Salt Master, including the PKI keys, certificates, the Salt Minion on cfg01, and the metadata model

To configure the Backupninja restore pipeline:

Verify that the class with the Backupninja restore Jenkins job is present in the pillar:

salt -C 'I@jenkins:client and not I@salt:master' pillar.get jenkins:client:job:backupninja_restore

If the pillars do not include the Backupninja restore pipeline:
1. Add the following class to cicd/control/leader.yml:
```
- system.jenkins.client.job.deploy.backupninja_restore
```
2. Rerun the jenkins state on the cid01 node:
```
salt -C 'I@jenkins:client and not I@salt:master' state.sls jenkins.client
```

If you plan to restore the Salt Master node and MAAS:

In cluster/infra/config/init.yml, configure the pillar for salt-master and salt-minion:

parameters:
   salt:
   master:
      initial_data:
         engine: backupninja
         source: ${_param:backupninja_backup_host}  # the backupninja server that stores Salt Master backups, for example: kvm03
         host: ${_param:infra_config_hostname}.${_param:cluster_domain}  # for example: cfg01.deploy-name.local
         home_dir: '/path/to/backups/' # for example: '/srv/volumes/backup/backupninja'
   minion:
      initial_data:
         engine: backupninja
         source: ${_param:backupninja_backup_host}  # the backupninja server that stores Salt Master backups
         host: ${_param:infra_config_hostname}.${_param:cluster_domain}  # for example: cfg01.deploy-name.local
         home_dir: '/path/to/backups/' # for example: '/srv/volumes/backup/backupninja'

Note

If salt-master restore has been already performed on your cluster, first delete the master-restored and minion-restored files in /srv/salt.

In cluster/infa/maas.yml, add the following class for MAAS restore:
```
- system.maas.region.restoredb
```

^{Since 2019.2.12} If you plan to restore the Keystone credential keys and such restore operation has been already performed on your cluster, first delete the .keystone_restored file in /etc/salt.

Create an instant backup using Backupninja pipeline¶

This section describes how to create an instant backup of the MAAS PostgreSQL database, the Salt Master node, and the Dogtag server files and database using the Backupninja Jenkins pipeline job.

Note

Backup of the Dogtag server and database using the pipeline is available starting from the 2019.2.6 maintenance update only. For previous versions, use the pipeline job only to back up the Salt Master node and MAAS PostgreSQL database. To back up Dogtag on the MCP version prior to 2019.2.6, see: Back up the Dogtag server files and database prior to 2019.2.6.

Warning

The pipeline job executes all configured Backupninja jobs present on the related MAAS, Salt Master, and Dogtag hosts.

To create an instant backup using the Jenkins pipeline job:

Verify that you have completed the steps described in Enable a backup schedule for the Salt Master node using Backupninja and Enable a backup schedule for a MAAS PostgreSQL database using Backupninja to backup Salt master node and MaaS.
Verify that you have completed the steps described in Prepare for the Dogtag backup and restore.
Configure the backup pipeline job as described in Configure the Backupninja backup pipeline.
Log in to the Jenkins web UI.
Select from the following options:
- For MCP versions prior to 2019.2.6, open the Backupninja salt-master/MaaS backup pipeline job.
- For MCP versions starting from 2019.2.6, open the Backupninja backup pipeline pipeline job.

Specify the required parameters:

Backupninja backup pipeline parameters¶
Parameter	Description and values
SALT_MASTER_URL	Define the IP address of your Salt Master node host and the `salt-api` port. For example, `http://172.18.170.27:6969`.
SALT_MASTER_CREDENTIALS	Define `credentials_id` as credentials for the connection.
ASK_CONFIRMATION	Select if you want the pipeline job to wait for a manual confirmation before running specific steps.
BACKUP_SALTMASTER_AND_MAAS ^{Added since 2019.2.6}	Select to back up Salt Master and MAAS.
BACKUP_DOGTAG ^{Added since 2019.2.6}	Select to back up the Dogtag files and database.
BACKUP_KEYSTONE_CREDENTIAL_KEYS ^{Added since 2019.2.12}	Select to back up the Keystone credential keys.

Click Build.

Warning

During the first backup using the Jenkins pipeline job, a manual confirmation may be required to install the required packages. After the first successful run, no manual confirmation is required.

To disable the manual confirmation, set the ASK_CONFIRMATION parameter to False.

The pipeline job workflow:
1. Pillar verification. Verify that initial_data in pillars are defined correctly to prevent any issues related to a wrong configuration during the execution of the pipeline job.
2. Backup location check. Verify that the backup node is ready and all required packages are installed.
3. Backup preparation. Verify that the backupninja states are up to date.
4. Backup execution.
Verify that the backup has been created and, in case of a remote backup storage, moved correctly.

Restore the services using Backupninja pipeline¶

This section describes how to restore the MAAS PostgreSQL database, Salt Master node, and Dogtag server files and database using the Backupninja Jenkins pipeline job.

Note

Restore of the Dogtag server and database using the pipeline is available starting from the 2019.2.11 maintenance update only. For previous versions, use the Backupninja pipeline job only to restore the Salt Master node and MAAS PostgreSQL database.

To restore the services using the Jenkins pipeline job:

Verify that you have completed the steps described in Configure the Backupninja restore pipeline.
Log in to the Jenkins web UI.
Select from the following options:
- For MCP versions prior to 2019.2.6, open the Backupninja restore salt-master/MaaS backup pipeline job.
- For MCP versions starting from 2019.2.6, open the Backupninja restore pipeline pipeline job.

Specify the required parameters:

Backupninja restore pipeline parameters¶
Parameter	Description and values
SALT_MASTER_URL	Add the IP address of your Salt Master node host and the `salt-api` port. For example, `http://172.18.170.27:6969`.
CREDENTIALS_ID	Add `credentials_id` as credentials for the connection.
RESTORE_SALTMASTER_AND_MAAS ^{Added since 2019.2.6}	Select to restore Salt Master and MAAS.
RESTORE_KEYSTONE_CREDENTIAL_KEYS ^{Added since 2019.2.12}	Select to restore the Keystone credential keys.
RESTORE_DOGTAG ^{Added since 2019.2.11}	Select to restore the Dogtag files and database.

Click Build.

The Jenkins pipeline job workflow:
1. Pillar verification. Verify that initial_data in pillars are defined correctly to prevent any issues related to a wrong configuration during the execution of the pipeline job.
2. Perform the restore.
  
  If the pipeline job fails during the Dogtag restore with the "Rendering SLS 'base:barbican.server' failed: Jinja variable 'dict object' has no attribute 'key'" error that may occur, for example, due to the mine data deletion after calling the mine.flush function:
  1. Obtain the Dogtag certificate location:
    salt -C 'I@dogtag:server:role:master' pillar.get \ dogtag:server:export_pem_file_path
    Example of system response:
    /etc/dogtag/kra_admin_cert.pem
  2. Apply the following state:
    
    Note
    
    In the state below, substitute the certificate path with the one you obtained in the previous step.
    salt -C 'I@dogtag:server:role:master' mine.send dogtag_admin_cert \ mine_function=cmd.run 'cat /etc/dogtag/kra_admin_cert.pem'
  3. Rerun the failed Barbican state.
Verify that the restore completed and, in case of a remote backup storage, moved correctly:
1. Verify the Salt Master node:
  1. Verify the Salt keys in /etc/salt/pki/.
  2. Verify the model in /srv/salt/reclass.
  3. Try to perform a test ping on the available Salt Minions using test.ping.
2. Verify MAAS:
  1. Verify the MAAS UI.
  2. List the available machines in MAAS through the maasng.list_machines Salt module.
Verify Dogtag:
1. Verify the server data.
2. Verify the saved secrets in the database.

Scheduled maintenance with a planned power outage¶

This section provides the instruction on how to correctly power off you MCP deployment partially or entirely in case of power outage due to a planned maintenance, upgrade, or power supply testing.

Caution

If you plan on powering off a whole MCP cluster, verify that your cloud environment workloads and applications are ready for the compute nodes shutdown before proceeding.

Shut down an MCP OpenStack environment partially¶

You can power off your MCP OpenStack environment partially. Before you start powering off the required nodes, verify that two Galera dbs nodes are running. If only one Galera node is shut down and at least one another is running, you will avoid any downtime in your cloud environment.

Once the verification is performed, proceed with powering off the required nodes using the following command:

salt '<node_name>' system.poweroff

Shut down the whole MCP OpenStack environment¶

Before you proceed with the procedure, verify that all backups are up to date and, if possible, are stored externally.

If you have Ceph in your MCP cluster, power it off as described in Shut down a Ceph cluster.

To shut down an MCP OpenStack environment:

Power off the prx nodes to discontinue all external connections:
```
salt 'prx*' system.poweroff
```
Power off the OpenStack bare metal bmt controller nodes:
```
salt 'bmt*' system.poweroff
```
Power off the ctl nodes:
```
salt 'ctl*' system.poweroff
```

If OpenContrail is deployed, power off the ntw and nal nodes:

salt 'ntw*' system.poweroff
salt 'nal*' system.poweroff

Power off the gtw nodes:
```
salt 'gtw*' system.poweroff
```
Power off the msg nodes:
```
salt 'msg*' system.poweroff
```

Power off the dbs nodes in the following strict order:

salt 'dbs03*' system.poweroff
salt 'dbs02*' system.poweroff
salt 'dbs01*' system.poweroff

Power off all StackLight LMA nodes:

salt 'log*' system.poweroff
salt 'mon*' system.poweroff
salt 'mtr*' system.poweroff

If you have any containers running on the cid nodes, for example, Jenkins, stop them.
Power off the cid nodes:
```
salt 'cid*' system.poweroff
```
If local Aptly is used, power off the Aptly node.
```
salt 'apt01*' system.poweroff
```
Power off the cmp nodes:
```
salt 'cmp*' system.poweroff
```
Verify that all control plane instances are down except for the cfg01 node if the latter runs on a KVM node:
```
salt 'kvm*' cmd.run 'virsh list | grep -v cfg01'
```
Power off the kvm nodes.

Since the cfg01 node usually runs on a KVM node, postpone the KVM nodes shutdown, for example, for 15 minutes, to safely shut down the cfg01 node:

Warning

After running the command below, you will lose any access to the environment in 15 minutes.
```
salt 'kvm*' cmd.run 'shutdown --poweroff 15' --async
```
Power off the cfg01 node before the kvm nodes have powered off:
```
salt 'cfg01*' system.poweroff
```

Start an MCP OpenStack environment after a power-off¶

This section instructs on how to correctly start an MCP OpenStack environment after it was powered off.

If you have Ceph in your MCP cluster, start it as described in Start a Ceph cluster.

To start an MCP OpenStack environment:

Start the kvm01 node hosting the dbs01 node.
Start the kvm02 node hosting the dbs02 node.
Start the kvm03 node hosting the dbs03 node.
Note

By default, the autostart option is enabled for the host nodes. Therefore, all VMs they host will start automatically.

Otherwise, start the database server nodes in the following strict order:
1. Start the dbs01 node.
2. Start the dbs02 node.
3. Start the dbs03 node.
Verify that the cluster is up. In case of issues, restore the cluster as described in Restore a Galera cluster.
Start the RabbitMQ server msg nodes.
Verify that the cluster is up. In case of issues, proceed with Troubleshoot RabbitMQ.
Start the gtw nodes.
Start the OpenStack controller ctl nodes.
Start the bare metal bmt nodes.
For the OpenContrail deployments, start the nal and ntw nodes.
Verify that the cluster is up. In case of issues, proceed with Troubleshoot OpenContrail.
Start the proxy prx nodes.
Start the StackLight LMA mon, mtr, and log nodes.
Start the StackLight OSS cid nodes.

Shut down a Ceph cluster¶

This section describes how to shut down an entire Ceph cluster. Make sure that your Ceph cluster is healthy before proceeding with the following steps.

Stop all clients that write to Ceph.
Shut down the RADOS Gateway nodes:
```
salt 'rgw*' system.poweroff
```

Set the Ceph OSD flags:

ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover
ceph osd set norebalance
ceph osd set nodown
ceph osd set pause

Shut down the Ceph OSD nodes one by one:

salt 'osd01*' system.poweroff
salt 'osd02*' system.poweroff
salt 'osd03*' system.poweroff
...

Shut down the Ceph Monitor nodes one by one:

salt 'cmn01*' system.poweroff
salt 'cmn02*' system.poweroff
salt 'cmn03*' system.poweroff
...

Start a Ceph cluster¶

This section describes how to correctly start a Ceph cluster after it was powered off.

Power on the Ceph Monitor nodes.
Power on the Ceph OSD nodes.
Wait for all nodes to become available.

Unset the Ceph OSD flags:

ceph osd unset noout
ceph osd unset nobackfill
ceph osd unset norecover
ceph osd unset norebalance
ceph osd unset nodown
ceph osd unset pause

Power on the RADOS Gateway nodes.
Start all clients that write to Ceph.

Upgrade and update an MCP cluster¶

A typical MCP cluster includes multiple components, such as DriveTrain, StackLight, OpenStack, OpenContrail, and Ceph. Most of MCP components have their own versioning schema. For the majority of the components, MCP supports multiple versions at once.

The upgrade of an MCP deployment to a new version is a multi-step process that needs to take into account the cross-dependencies between the components of the platform, and compatibility matrix of supported versions of the components.

The MCP components that do not have their own versioning schema within MCP and are versioned by the MCP release include:

The DriveTrain components: Aptly, Gerrit, Jenkins, Reclass, Salt formulas and their subcomponents
StackLight LMA

Caution

Before proceeding with the upgrade procedure, verify that you have updated DriveTrain including Aptly, Gerrit, Jenkins, Reclass, Salt formulas, and their subcomponents to the current MCP release version. Otherwise, the current MCP product documentation is not applicable to your MCP deployment.

Note

Starting from the MCP 2019.2.16 maintenance update, before proceeding with any update or upgrade procedure, first verify that Nova cell mapping is enabled. For details, see Disable Nova cell mapping.

Note

Starting from the MCP 2019.2.17 maintenance update, before proceeding with the next update procedure, you can verify that the model contains information about the necessary fixes and workarounds. For details, see Verify DriveTrain.

For the MCP components with support for multiple versions, such as OpenStack or OpenContrail, you usually can select between two operations:

Minor version update (maintenance update)
New minor versions of the components artifacts are installed. Services are restarted as necessary. This kind of update allows you to obtain the latest bug and security fixes for the components, but it typically does not change the components capabilities.
Major version update (upgrade)
New major versions of the components artifacts are installed. Additional orchestration tasks are executed to change the components configuration, if necessary. This kind of update typically changes and improves the components capabilities.

The following table outlines a general upgrade and update procedure workflow of an MCP cluster. For the detailed upgrade and update workflow of MCP components, refer to the corresponding sections below.

**General upgrade and update procedure workflow**¶
#	Stage	Description
1	Upgrade or update DriveTrain	Perform the basic LCM update or upgrade: Update the Reclass system. Fetch the corresponding Git repositories. Update all binary repository definitions on the Salt Master node. Update and sync all Salt formulas. Apply the `linux.repo,linux.user` and `openssh` states on all nodes. Upgrade or update the DriveTrain services. Optional. Upgrade system packages on the Salt Master node. Upgrade or update GlusterFS: Upgrade or update packages for the GlusterFS server on each target host one by one. Upgrade or update packages for the GlusterFS clients and re-mount volumes on each target GlusterFS client host one by one. Obtain the `cluster.max-op-version` option value from GlusterFS and compare it with `cluster.op-version` to identify whether a version upgrade is required. Update `cluster.op-version`. Optional. Configure allowed and rejected IP addresses for the GlusterFS volumes.
2	Upgrade or update OpenContrail (if applicable)	Verify the OpenContrail service statuses. Back up the Cassandra and ZooKeeper data. Stop the Neutron server services. Upgrade or update the OpenContrail analytics nodes simultaneously. During upgrade, new Docker containers for the OpenContrail analytics nodes are spawned. During update, the corresponding Docker images are updated. Upgrade or update the OpenContrail controller nodes. During upgrade, new Docker containers for the OpenContrail controller nodes are spawned. During update, the corresponding Docker images are updated. All nodes are upgraded or updated simultaneously except the one that meantime runs the `contrail-control` service and is upgraded or updated after other nodes. Upgrade or update the OpenContrail packages on the OpenStack controller nodes simultaneously. Start the Neutron server services. Upgrade or update the OpenContrail data plane nodes one by one with the workloads migration if needed since this step implies downtime of the Networking service.
3	Upgrade or update OpenStack or Kubernetes	For OpenStack: On every OpenStack controller node one by one: Stop the OpenStack API services. Upgrade or update the OpenStack packages. Start the OpenStack services. Apply the OpenStack states. Verify that the OpenStack services are up and healthy. Upgrade the OpenStack data plane. Caution We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.
4	Upgrade or update Galera	Prepare the Galera cluster for the upgrade. Upgrade or update the MySQL and Galera packages on the Galera nodes one by one. Verify the cluster status after upgrade.
5	Upgrade or update RabbitMQ	Prepare the Neutron service for the RabbitMQ upgrade or update. Verify that the RabbitMQ upgrade pipeine job is present in Jenkins. Upgrade or update the RabbitMQ component. Caution We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.
		For Kubernetes: Upgrade or update essential Kubernetes binaries, for example, `hypercube`, `etcd`, `cni`. Restart essential Kubernetes services. Upgrade or update the addons definitions with the latest images. Perform the Kubernetes control plane changes, if any, on every Kubernetes Master node one by one. Upgrade or update the Kubernetes Nodes one by one.
6	Upgrade or update StackLight	During upgrade, enable the Ceph Prometheus plugin (if applicable). Upgrade or update system components including Telegraf, Fluentd, Prometheus Relay, libvirt-exporter, and jmx-exporter. Upgrade or update Elasticsearch and Kibana one by one: Stop the corresponding service on all `log` nodes. Upgrade or update the packages to the newest version. For Elasticsearch, reload the `systemd` configuration. Start the corresponding service on all `log` nodes. Verify that the Elasticsearch cluster status is `green`. In case of a major version upgrade, transform the indices for the new version of Elasticsearch and migrate Kibana to the new index. Upgrade or update components running in Docker Swarm: Disable and remove the previous versions of monitoring services. Rebuild the Prometheus configuration by applying the `prometheus` state on the `mon` nodes. Disable and remove the previous version of Grafana. Start the monitoring services by applying the `docker` state on the `mon` nodes. Apply the `saltutil.sync_all` state and the `grafana.client` state to refresh the Grafana dashboards.
7	Upgrade Ceph or Update Ceph	For upgrade: Prepare the Ceph cluster for upgrade. Perform the backup. Upgrade the Ceph repository on each node one by one. Upgrade the Ceph packages on each node one by one. Restart the Ceph services on each node one by one. Verify the upgrade on each node one by one and wait for user input to proceed. Perform the post-upgrade procedures. For update: Update and install new Ceph packages on the `cmn` nodes. Restart Ceph Monitor services on all `cmn` nodes one by one. Starting from the 2019.2.8 update, restart Ceph Manager on all `mgr` nodes one by one. Update and install new Ceph packages on the `osd` nodes. Restart Ceph OSDs services on all `osd` nodes one by one. Update and install new Ceph packages on the `rgw` nodes. Restart Ceph RADOS Gateway services on all `rgw` nodes one by one. After the restart of every service, wait for the system to become healthy.
8	Update the base operating system	Install security updates on all nodes. To reduce the size of new packages to be installed on a cluster during update or upgrade, this is the final step of the procedure. However, you can perform it at any stage to fetch only security patches.
9	Apply issues resolutions requiring manual application described in the Addressed issues sections of all Maintenance updates.	Apply fixes that require manual application for all maintenance updates one by one.

Upgrade an MCP cluster¶

Caution

Use the procedures in this section to upgrade your MCP cluster. More specifically, major upgrades enable:

Upgrading the MCP release version to the latest supported Build ID
Delivering new features of MCP Control plane
Updating the LCM platform
Upgrading host and guest operating systems to new versions including kernels
Upgrading between major OpenStack releases
Upgrading a Kubernetes cluster including Calico and etcd
Upgrading the OpenContrail nodes from version 3.2 to 4.1
Upgrading StackLight LMA
Upgrading Ceph

Warning

Before upgrading your MCP cluster, verify the compatibility of different components versions in a specific MCP release. For details, see MCP compatibility matrix.

This section describes the upgrade procedures of the MCP components.

Upgrade DriveTrain to a newer release version¶

You can upgrade your MCP deployment to a certain MCP release version through DriveTrain using the Deploy - upgrade MCP DriveTrain pipeline. An MCP release version is a stable and product-ready combination of versions of MCP components tagged with a specific Build ID.

This section describes how to upgrade your MCP version from the Build ID 2018.11.0 to 2019.2.0. To upgrade an MCP version with an earlier Build ID, refer to the upgrade paths table in Release Compatibility Matrix: Supported upgrade paths.

Use this procedure to apply minor updates to MCP DriveTrain that will allow for further updates of the MCP components packages to higher minor versions as described in Update an MCP cluster. For a major upgrade of release versions of OpenStack, Kubernetes, and other MCP components, refer to the corresponding sections of Upgrade an MCP cluster.

Caution

The OpenContrail 4.x packages update is covered in a separate procedure. For details, see: Update the OpenContrail 4.x nodes.

To update the OpenContrail 3.2 packages, refer to the Upgrade the OpenContrail nodes to version 3.2 procedure after you upgrade MCP to a newer release version.

Prerequisites¶

Before you proceed with the upgrade procedure, complete the following steps:

Back up the Salt Master node as described in Back up and restore the Salt Master node.
If Jenkins is enabled on the Salt Master node, add the following parameters to ./classes/cluster/<cluster_name>/infra/config/jenkins.yml of your Reclass model:
- To the _param section:
```
jenkins_master_protocol: http
```
- To the jenkins:client:master section:
```
proto: http
```
Log in to the Jenkins web UI.
Run the following pipeline jobs with BRANCHES set to *:
- git-mirror-downstream-mk-pipelines
- git-mirror-downstream-pipeline-library
Caution

Before you proceed to Upgrade to MCP release version 2019.2.0, verify that these jobs succeed.

Upgrade to MCP release version 2019.2.0¶

This section describes how to upgrade the MCP release version on your deployment from the Build ID 2018.11.0 to 2019.2.0.

To upgrade to MCP release version 2019.2.0:

Verify that you have completed the steps described in Prerequisites.
Log in to the Salt Master node.
From the /srv/salt/reclass/classes/cluster directory of your Reclass model, verify that the correct Salt formulas and OpenContrail repositories are enabled in your deployment:

Note

Starting from the 2019.2.0 MCP version, the Salt formulas and OpenContrail repositories are moved from http://apt.mirantis.com to http://mirror.mirantis.com.
```
grep -r --exclude-dir=aptly -l 'system.linux.system.repo.mcp.salt'
```
If the matches are found, replace accordingly on the cluster Reclass level:
- system.linux.system.repo.mcp.salt replace with system.linux.system.repo.mcp.apt_mirantis.salt-formulas
- system.linux.system.repo.mcp.updates replace with system.linux.system.repo.mcp.apt_mirantis.update
- system.linux.system.repo.mcp.contrail replace with system.linux.system.repo.mcp.apt_mirantis.contrail
- system.linux.system.repo.mcp.extra replace with system.linux.system.repo.mcp.apt_mirantis.extra
Depending on your cluster configuration, add the update repositories for the required components. The list of the update repositories include cassandra, ceph, contrail, docker, elastic, extra, kubernetes_extra, openstack, percona, salt-formulas, saltstack, and ubuntu.

For example, to add the update repository for OpenStack:
1. Change the directory to /srv/salt/reclass/classes/cluster.
2. Verify whether the OpenStack component is present in the model:
```
grep -r --exclude-dir=aptly -l 'system.linux.system.repo.mcp.apt_mirantis.openstack'
```
3. If matches are found, include the update repository in your Reclass model by editing the files that include these matches:
```
classes:
- system.linux.system.repo.mcp.apt_mirantis.openstack
- system.linux.system.repo.mcp.apt_mirantis.update.openstack
```
Open the project Git repository with your Reclass model on the cluster level.
In /infra/backup/client_mysql.yml, verify that the following parameters are defined:
```
parameters:
  xtrabackup:
    client:
      cron: false
```

In /infra/backup/server.yml, verify that the following parameters are defined:

parameters:
  xtrabackup:
    server:
      cron: false
  # if ceph is enabled
  ceph:
    backup:
      cron: false

If any physical node in your cluster has LVM physical volumes configured, for example, the root partition, define these volumes in your Reclass model.

You can verify where LVM is configured using the pvdisplay or lvm pvs command. Apply the following state from the Salt Master node:
```
salt '*' cmd.run 'lvm pvs'
```
For example, if one of your physical nodes has /dev/sda1 for a physical volume of a volume group vgroot and a logical volume lvroot (/dev/vgroot/lvroot) mounted as /, add the following pillar data for this node:
```
parameters:
  linux:
    storage:
      lvm:
        vgroot:
          enabled: true
          devices: /dev/sda1
```
For example, if all your compute nodes have LVM physical volume configured, add the above pillar to /openstack/compute/init.yml.

Warning

You must add the above pillar data for all nodes with all LVM volume groups configured. Otherwise, the LVM configuration will be updated improperly during the upgrade and a node will be unable to boot from the logical volume.
If OpenContrail 3.2 is used, verify that the following configurations are present in your Reclass model:
- In the /infra/backup/client_zookeeper.yml and /infra/backup/server.yml files:
```
parameters:
  zookeeper:
    backup:
      cron: false
```
- In the /infra/backup/client_cassandra.yml and /infra/backup/server.yml files:
```
parameters:
  cassandra:
    backup:
      cron: false
```
Caution

The OpenContrail 4.x update is covered in a separate procedure. For details, see: Update the OpenContrail 4.x nodes.
If OpenStack Telemetry is used, switch Redis to use password authentication:

Warning

During this procedure, a short Tenant Telemetry downtime occurs.
1. In /infra/secrets.yml, add a password for Redis:
```
parameters:
  _param:
    openstack_telemetry_redis_password_generated: <very_strong_password>
```
  - The password can contain uppercase or lowercase letters from the latin alphabet (A-Z) and digits (0-9).
  - The recommended password length is 32 characters.
  Warning
  
  Since the key feature of Redis is high performance, an attacker can try many passwords per second. Therefore, create a very strong password to prevent information leak.
2. In /openstack/init.yml, add the following parameter:
```
parameters:
  _param:
    openstack_telemetry_redis_password: ${_param:openstack_telemetry_redis_password_generated}
```
3. In /openstack/telemetry.yml, update the following definitions:
  - Update the openstack_telemetry_redis_url parameter value. For example:
    parameters _param: openstack_telemetry_redis_url: redis://openstack:${_param:openstack_telemetry_redis_password}@${_param:redis_sentinel_node01_address}:26379?sentinel=master_1&sentinel_fallback=${_param:redis_sentinel_node02_address}:26379&sentinel_fallback=${_param:redis_sentinel_node03_address}:26379
  - Add the password parameter to the following section:
    redis: cluster: ... password: ${_param:openstack_telemetry_redis_password} ...
4. Refresh pillars:
```
salt 'mdb*' saltutil.pillar_refresh
```
5. Apply the changes:
  - For the Redis cluster:
    
    Warning
    
    After applying the Redis states, the Tenant Telemetry services will not be able to connect to Redis.
    salt -C 'I@redis:cluster:role:master' state.sls redis salt -C 'I@redis:server' state.sls redis
  - For the Tenant Telemetry components:
    salt -C 'I@gnocchi:server' state.sls gnocchi.server salt -C 'I@ceilometer:server' state.sls ceilometer.server salt -C 'I@aodh:server' state.sls aodh.server
  Note
  
  Once you apply the Salt states above, Tenant Telemetry will become fully operational again.
In /cicd/control/leader.yml, verify that the following class is present:
```
classes:
- system.jenkins.client.job.deploy.update
```
Commit the changes to your local repository.
Log in to the Jenkins web UI.
Verify that you do not have any unapproved scripts in Jenkins:
1. Navigate to Manage Jenkins > In-process script approval.
2. Approve pending scripts if any.

Run the Deploy - upgrade MCP DriveTrain pipeline in the Jenkins web UI specifying the following parameters as required:

**Deploy - upgrade MCP DriveTrain pipeline parameters**¶
Parameter	Description
`SALT_MASTER_URL`	Salt Master API URL string
`SALT_MASTER_CREDENTIALS`	Jenkins credentials to access the Salt Master API
`BATCH_SIZE` ^{Added since 2019.2.6 update}	The batch size for Salt commands targeted for a large amount of nodes. Set to an absolute number of nodes (integer) or percentage, for example, 20 or 20%. For details, see MCP Deployment Guide: Configure Salt Master threads and batching.
`OS_DIST_UPGRADE` ^{For updates starting from 2019.2.2 to 2019.2.8 or later}	Applicable only for maintenance updates starting from 2019.2.2 or later to 2019.2.8 and later. Set to `true` to upgrade the system packages including kernel using `apt-get dist-upgrade` on the `cid*` nodes. Note Prior to the maintenance update 2019.2.8, manually specify the parameter in the `DRIVE_TRAIN_PARAMS` field. For later versions, the parameter is predefined and set to `false` by default.
`OS_UPGRADE` ^{For updates starting from 2019.2.2 to 2019.2.8 or later}	Applicable only for maintenance updates starting from 2019.2.2 or later to 2019.2.8 and later. Set to `true` to upgrade all installed applications using `apt-get upgrade` on the `cid*` nodes. Note Prior to the maintenance update 2019.2.8, manually specify the parameter in the `DRIVE_TRAIN_PARAMS` field. For later versions, the parameter is predefined and set to `false` by default.
`APPLY_MODEL_WORKAROUNDS` ^{For maintenance updates to 2019.2.8 or later}	Applicable only for maintenance updates to 2019.2.8 and later. Recommended. Select to apply the cluster model workarounds automatically unless you manually added some patches to the model before the update.
`UPGRADE_SALTSTACK`	Upgrade the SaltStack packages (`salt-master`, `salt-api`, `salt-common`) on the Salt Master node and the `salt-minion` package on all nodes.
`UPDATE_CLUSTER_MODEL`	Automatically apply the cluster model changes required for the target MCP version: Replace the `mcp_version` parameter with `TARGET_MCP_VERSION` on the cluster level of Reclass model Apply other backward incompatible cluster model updates
`UPDATE_PIPELINES`	Automatically update the `pipeline-library` and `mk-pipelines` repositories from upstream/local mirror into Gerrit
`UPDATE_LOCAL_REPOS` ^{Deprecated since MCP 2019.2.0}	Update local repositories on the Aptly node if applicable.
`GIT_REFSPEC`	Git version of the Reclass system to use (branch or tag). Must match `TARGET_MCP_VERSION`.
`MK_PIPELINES_REFSPEC`	Git version of `mk/mk-pipelines` to use (branch or tag). Must match `TARGET_MCP_VERSION`.
`PIPELINE_TIMEOUT`	The time for the Jenkins job to complete, set to 12 hours by default. If the time is exceeded, the Jenkins job will be aborted.
`TARGET_MCP_VERSION`	Target version of MCP that will correspond to the `mcp_version` parameter value in your Reclass model. Select from the following options: If you upgrade DriveTrain to the latest major version, set the corresponding latest available Build ID. To apply maintenance updates to a major MCP release version, select from the following options: Specify your current MCP version. For example, if your current MCP version is 2019.2.0, set `2019.2.0`. Starting from 2019.2.8, alternatively, set the `TARGET_MCP_VERSION`, `MK_PIPELINES_REFSPEC`, and `GIT_REFSPEC` parameters to the target maintenance update version. For example, to update from 2019.2.7 to 2019.2.8, specify `2019.2.8`.

The pipeline workflow:

All required updates and fixes for upgrade are applied.
Packages for Salt formulas and Reclass are updated, cluster model is updated.
The following DriveTrain components are updated:
- Local repositories if needed
- Salt Master and minions
- Jenkins
- Docker

Optional. To upgrade system packages on the Salt Master node, select from the following options:
- If you are upgrading DriveTrain to the latest major version, run the following commands one by one from the cfg01 node:
```
apt-get update
apt-get dist-upgrade
```
- If you are applying maintenance updates to a major MCP version, run the following commands one by one from the cfg01 node:
```
apt-get update
apt-get upgrade
```
Proceed to Upgrade GlusterFS.

Upgrade GlusterFS¶

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section describes how to upgrade GlusterFS on your deployment from version 3.8 to 5.5 using three dedicated pipeline jobs.

If you do not have any services that run on top of the GlusterFS volumes except the Docker Swarm services such as Jenkins, Gerrit, LDAP, you can use the all-in-one Update GlusterFS Jenkins pipeline job that consequently executes three dedicated pipeline jobs described below with the default parameters.

Otherwise, if you do have any services that run on top of the GlusterFS volumes except the Docker Swarm services, Mirantis recommends updating the GlusterFS components separately using the granular pipeline jobs to better control the upgrade process.

Note

This procedure does not include the backup of data located on the GlusterFS volumes.

To upgrade GlusterFS to version 5.5:

In infra/init.yml of your Reclass model, add the following parameter:

parameters:
  _param:
    linux_system_repo_mcp_glusterfs_version_number: 5

Apply the following state:
```
salt '*' state.apply linux.system.repo
```
Log in to your MCP cluster Jenkins web UI.
Verify that you do not have any unapproved scripts in Jenkins:
1. Navigate to Manage Jenkins > In-process script approval.
2. Approve pending scripts if any.

Upgrade GlusterFS on servers by running the Update glusterfs servers pipeline job with the following parameters as required:

**Update glusterfs servers pipeline job parameters**¶
Parameter	Description and values
`SALT_MASTER_URL`	The Salt Master node host URL with the `salt-api port`, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
`SALT_MASTER_CREDENTIALS`	The Salt Master credentials to use for connection, defaults to `salt`.
`TARGET_SERVERS`	Salt compound target to match the nodes to be upgraded. For example, `G@osfamily:debian` or `*`. Defaults to `I@glusterfs:server`.
`IGNORE_SERVER_STATUS`	Not recommended. Select not to validate the GlusterFS server availability before the upgrade. If some servers are unavailable, then data on a volume may be unavailable during the upgrade.
`IGNORE_NON_REPLICATED_VOLUMES`	Not recommended. Select to upgrade GlusterFS even with a non-replicated volume. You may lose data on the non-replicated volumes during the upgrade. Therefore, if you have such volumes, Mirantis recommends stopping them before the upgrade.

The pipeline job workflow:

Select only the TARGET_SERVERS hosts on which the new GlusterFS packages are available and not installed.
Verify that all TARGET_SERVERS are available and connected to the cluster. This step is skipped if the IGNORE_SERVER_STATUS parameter is set to true.
Verify that all GlusterFS volumes are replicated. This step is skipped if the IGNORE_NON_REPLICATED_VOLUMES parameter is set to true.
Upgrade the GlusterFS packages on each target host one by one.

Upgrade GlusterFS on clients by running the Update glusterfs clients pipeline job with the following parameters as required:

Caution

Except for the Docker Swarm services such as Jenkins, Gerrit, LDAP, the pipeline job ignores any other services that use the GlusterFS volumes. Therefore, Mirantis recommends stopping any not Docker Swarm-related services that use the GlusterFS volumes before running the pipeline job and upgrade clients one by one by targeting the global name of a client using the TARGET_SERVERS parameter.

Note

During the upgrade of the GlusterFS clients, all infrastructure services such as Jenkins, Gerrit, LDAP are unavailable for several minutes.

**Update glusterfs clients pipeline job parameters**¶
Parameter	Description and values
`SALT_MASTER_URL`	The Salt Master node host URL with the `salt-api port`, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
`SALT_MASTER_CREDENTIALS`	The Salt Master credentials to use for connection, defaults to `salt`.
`TARGET_SERVERS`	Salt compound target to match the nodes to be upgraded. For example, `G@osfamily:debian` or `*`. Defaults to `I@glusterfs:client`.
`IGNORE_SERVER_STATUS`	Not recommended. Select not to validate the GlusterFS server availability before the upgrade. If some servers are unavailable, then data on volumes may be unavailable during the upgrade. Mirantis recommends verifying that the cluster is fully functional and healthy before proceeding with the upgrade.
`IGNORE_SERVER_VERSION`	Not recommended. Select not to validate that all GlusterFS servers are upgraded to the same version. Mirantis highly recommends upgrading all GlusterFS servers before you proceed to upgrading the GlusterFS clients.

The pipeline job workflow:

Select only the TARGET_SERVERS hosts on which the new GlusterFS packages are available and not installed.
Verify that all GlusterFS servers are available and connected to the cluster. This step is skipped if the IGNORE_SERVER_STATUS parameter is set to true.
Verify that all GlusterFS servers have the same package version that will be installed on the GlusterFS clients. This step is skipped if the IGNORE_SERVER_VERSION parameter is set to true.
Upgrade the GlusterFS package and re-mount the volumes on each host one by one.

Upgrade the GlusterFS cluster.op-version option to verify that all GlusterFS servers and clients use the updated protocol. Run the Update glusterfs cluster.op-version pipeline job with the following parameters as required:

**Update glusterfs cluster.op-version pipeline job parameters**¶
Parameter	Description and values
`SALT_MASTER_URL`	The Salt Master node host URL with the `salt-api port`, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
`SALT_MASTER_CREDENTIALS`	The Salt Master credentials to use for connection, defaults to `salt`.
`CLUSTER_OP_VERSION`	Leave empty to use the `cluster.max-op-version` option value defined in GlusterFS. Otherwise, specify the value for the GlusterFS `cluster.op-version` option to use.
`IGNORE_SERVER_VERSION`	Not recommended. Select not to validate that all GlusterFS servers are upgraded.
`IGNORE_CLIENT_VERSION`	Not recommended. Select not to validate that all GlusterFS servers are upgraded.

The pipeline job workflow:

If CLUSTER_OP_VERSION is empty, obtain the cluster.max-op-version option value from GlusterFS.
Obtain the current cluster.op-version and compare it the CLUSTER_OP_VERSION parameter value to identify whether a version update is required.
Verify that all GlusterFS servers are upgraded to the version equal or above the one defined in CLUSTER_OP_VERSION. This step is skipped if the IGNORE_SERVER_VERSION parameter is set to true.
Verify that all GlusterFS clients are upgraded to the version equal or above the one defined in CLUSTER_OP_VERSION. This step is skipped if the IGNORE_CLIENT_VERSION parameter is set to true.
Update cluster.op-version to the value defined in CLUSTER_OP_VERSION.

Optional. Recommended. Configure allowed and rejected IP addresses for the GlusterFS volumes.

See also

Update GlusterFS

Rollback MCP to a previous release version¶

You can rollback your MCP deployment to a previous MCP release version through DriveTrain using the Deploy - update cloud pipeline.

To rollback to the previous stable MCP release version:

Verify that the correct previous Build ID release repositories are available. These include the local repositories present in the mirror image backup as well as the local aptly repositories.
In the infra/init.yml file of the Reclass model, specify the previous Build ID in the mcp_version parameter.
In the infra/init.yml file of the Reclass model, verify that the following pillar is present.
```
parameters:
  linux:
    system:
      purge_repos: true
```
Update the classes/system Git submodule of the Reclass model to the commit of the required Build ID by running the following command from the classes/system directory:
```
git pull origin release/BUILD_ID
```
Commit and push the changes to the Git repository where the cluster Reclass model is located.
Select from the following options:
- The ROLLBACK_BY_REDEPLOY was not selected for the update pipeline:
  1. Roll back the Salt Master node:
    1. On the KVM node hosting the Salt Master node, run the following commands:
      virsh destroy cfg01.domain virsh define /var/lib/libvirt/images/cfg01.domain.xml virsh start cfg01.domain; virsh snapshot-delete cfg0X.domain --metadata ${SNAPSHOT_NAME} rm /var/lib/libvirt/images/cfg01.domain.${SNAPSHOT_NAME}.qcow2 rm /var/lib/libvirt/images/cfg01.domain.xml
    2. On the Salt Master node, apply the linux.system.repo Salt state.
  2. Roll back the CI/CD nodes:
    1. On the KVM nodes hosting the CI/CD nodes, run the following commands:
      virsh destroy cid0X.domain virsh define /var/lib/libvirt/images/cid0X.domain.xml virsh start cid0X.domain virsh snapshot-delete cid0X.domain --metadata ${SNAPSHOT_NAME} rm /var/lib/libvirt/images/cid0X.domain.${SNAPSHOT_NAME}.qcow2 rm /var/lib/libvirt/images/cid0X.domain.xml
    2. On all CI/CD nodes, restart the docker service and apply the linux.system.repo Salt state.
- The ROLLBACK_BY_REDEPLOY parameter was selected for the update pipeline:
  1. Roll back the Salt Master node as described in Back up and restore the Salt Master node.
  2. Redeploy the CI/CD nodes:
    1. Virsh destroy and undefine the cid nodes:
      virsh destroy cid<NUM>.<DOMAIN_NAME> virsh undefine cid<NUM>.<DOMAIN_NAME>
    2. Remove the cid salt-keys from the Salt Master node.
    3. Redeploy the CI/CD nodes using the Deploy CI/CD procedure.

Run the Deploy - update cloud pipeline in the Jenkins web UI specifying the following parameters as required:

**Deploy - update cloud pipeline parameters**¶
Parameter	Description
`CEPH_OSD_TARGET`	The Salt targeted physical Ceph OSD `osd` nodes.
`CID_TARGET`	The Salt targeted CI/CD `cid` nodes.
`CMN_TARGET`	The Salt targeted Ceph monitor `cmn` nodes.
`CMP_TARGET`	The Salt targeted physical compute `cmp` nodes.
`CTL_TARGET`	The Salt targeted controller `ctl` nodes.
`DBS_TARGET`	The Salt targeted database `dbs` nodes.
`GTW_TARGET`	The Salt targeted physical or virtual gateway `gtw` nodes.
`INTERACTIVE`	Ask interactive questions during the pipeline run. If not selected, the pipeline will either succeed or fail.
`KVM_TARGET`	The Salt targeted physical KVM `kvm` nodes.
`LOG_TARGET`	The Salt targeted log storage and visualization `log` nodes.
`MON_TARGET`	The Salt targeted StackLight LMA monitoring node `mon` nodes.
`MSG_TARGET`	The Salt targeted RabbitMQ server `msg` nodes.
`MTR_TARGET`	The Salt targeted StackLight LMA metering `mtr` nodes.
`NAL_TARGET`	The Salt targeted OpenContrail 3.2 analytics `nal` nodes.
`NTW_TARGET`	The Salt targeted OpenContrail 3.2 controller `ntw` nodes.
`PER_NODE`	Target nodes will be managed one by one. Recommended.
`PRX_TARGET`	The Salt targeted proxy `prx` nodes.
`RGW_TARGET`	The Salt targeted RADOS gateway `rgw` nodes.
`ROLLBACK_BY_REDEPLOY`	Select if live snapshots were taken during update.
`SALT_MASTER_URL`	URL of Salt Master node API.
`SALT_MASTER_CREDENTIALS`	ID of the Salt Master node API credentials stored in Jenkins.
`SNAPSHOT_NAME`	Live snapshot name.
`STOP_SERVICES`	Stop API services before the rollback.
`PURGE_PKGS`	The space-separated list of `pkgs=versions` to be purged on the physical targeted machines. For example, `pkg_name1=pkg_version1 pkg_name2=pkg_version2`.
`REMOVE_PKGS`	The space-separated list of `pkgs=versions` to be removed on the physical targeted machines. For example, `pkg_name1=pkg_version1 pkg_name2=pkg_version2`.
`RESTORE_CONTRAIL_DB`	Restore the Cassandra and ZooKeeper databases for OpenContrail 3.2. OpenContrail 4.x is not supported. Select only if rollback of the OpenContrail 3.2 controller nodes failed or a specific backup defined in the cluster model is required. If `RESTORE_CONTRAIL_DB` is selected, add the following configuration to the `cluster/opencontrail/control.yml` file of your Reclass model: For ZooKeeper: parameters: zookeeper: backup: client: enabled: true restore_latest: 1 restore_from: remote For Cassandra: parameters: cassandra: backup: client: enabled: true restore_latest: 1 restore_from: remote
`RESTORE_GALERA`	Restore the Galera database. Select only if the rollback of the database nodes failed or a specific backup defined in the cluster model is required. If `RESTORE_GALERA` is selected, add the `xtrabackup` restore lines to the `cluster/openstack/database/init.yml` of your Reclass model: parameters: xtrabackup: client: enabled: true restore_full_latest: 1 restore_from: remote
`ROLLBACK_PKG_VERSIONS`	Copy back the list of package versions installed before the update acquired from the pipeline before `pkgs` were upgraded. The space-separated list of `pkgs=versions` to roll back to on the physical targeted machines. For example, `pkg_name1=pkg_version1 pkg_name2=pkg_version2`. If `ROLLBACK_PKG_VERSIONS` is empty, apt --allow-downgrades dist-upgrade will be run on the targeted physical machines. If `ROLLBACK_PKG_VERSIONS` contains the `salt-minion` package, you will have to rerun the pipeline for every targeted physical machine as it will be disconnected.
`TARGET_ROLLBACKS`	The comma-separated list of nodes to rollback. The valid values include `ctl`, `prx`, `msg`, `dbs`, `log`, `mon`, `mtr`, `ntw`, `nal`, `gtw-virtual`, `cmn`, `rgw`, `cmp`, `kvm`, `osd`, `gtw-physical`.
`TARGET_REBOOT`	The comma-separated list of physical nodes to reboot after a rollback. The valid values include `cmp`, `kvm`, `osd`, `gtw-physical`. Caution When the `kvm` node is defined, the pipeline can be interrupted due to the Jenkins slave reboot. If so, remove the already updated nodes from `TARGET_UPDATES` and rerun the pipeline.
`TARGET_HIGHSTATE`	The comma-separated list of physical nodes to run Salt highstate on after a rollback. The valid values include `cmp`, `kvm`, `osd`, `gtw-physical`.

Common rollback workflow for different nodes types:

Downtime for VCP occurs if a rollback for VCP VMs is required.
If the ROLLBACK_BY_REDEPLOY parameter is selected, the VCP VMs will be destroyed, undefined, and their salt-key will be deleted from the Salt Master node.
Verification of the service or API status is done.

The procedure of the pipeline if ROLLBACK_BY_REDEPLOY is not selected:

VCP VMs by their target type are destroyed.
Live snapshot is deleted if any and its original base file is used to boot the VM.
Repositories are updated.
Physical machines are rolled back:
1. Repositories are updated.
2. Packages defined in PURGE_PKGS are purged.
3. Packages defined in REMOVE_PKGS are removed.
4. Package versions defined in ROLLBACK_PKG_VERSIONS are installed.
5. The Salt highstate is applied.

Verify that the following lines are not present in cluster/infra/backup/client_mysql.yml:
```
parameters:
  xtrabackup:
    client:
      cron: false
```
Verify that the following lines are not present in cluster/infra/backup/server.yml:
```
parameters:
  xtrabackup:
    server:
      cron: false
```
If OpenContrail 3.2 is used:
1. Verify that the following lines are not present in cluster/infra/backup/client_zookeeper.yml and cluster/infra/backup/server.yml:
```
parameters:
  zookeeper:
    backup:
      cron: false
```
2. Verify that the following lines are not present in cluster/infra/backup/client_cassandra.yml and cluster/infra/backup/server.yml:
```
parameters:
  cassandra:
    backup:
      cron: false
```
If the ROLLBACK_BY_REDEPLOY parameter was selected and the Deploy - cloud update pipeline succeeds, continue the rollback:
1. Roll back the Salt Master node as described in Back up and restore the Salt Master node procedure if necessary.
2. If Ceph is enabled and you want to rollback Ceph monitoring nodes, proceed with Restore a Ceph Monitor node.
3. Redeploy the nodes based on your model by running the Deploy - OpenStack pipeline. See the pipeline configuration details in Deploy an OpenStack environment, step 10.
  
  Note
  
  Specify k8s in the Install parameter field in case of an MCP Kubernetes deployment.
4. Rerun the Deploy - cloud update pipeline with RESTORE_GALERA and RESTORE_CONTRAIL_DB selected.

Upgrade the OpenContrail nodes from version 3.2 to 4.1¶

Caution

This section describes how to upgrade the OpenContrail nodes of an Ocata- or Pike-based MCP cluster to version 4.1 using the Deploy - upgrade Opencontrail to 4.x pipeline. To update OpenContrail from version 4.0 to 4.1, refer to Update the OpenContrail 4.x nodes.

Note

Upgrade the OpenContrail cluster before the OpenStack upgrade.

Warning

The OpenContrail data plane traffic will be affected during the upgrade procedure since restarting of a vRouter agent leads to connectivity interruption. Plan migrations of the critical workloads accordingly before the vRouter agent restart.

The OpenContrail upgrade to version 4.1 is available only from version 3.2. Therefore, if you have an older OpenContrail version, for example, 3.2.3 or 3.1.1, first update the OpenContrail packages on the OpenContrail nodes to the latest supported 3.2.x version using the Update the OpenContrail 3.2 package procedure.

The high-level workflow of the OpenContrail upgrade pipeline job is as follows:

Verify the OpenContrail services statuses.
Stop the Neutron API that will be unavailable during the whole upgrade procedure.
Back up the Cassandra and ZooKeeper data.
Stop all running OpenContrail analytics services.
Upgrade or update the OpenContrail analytics nodes simultaneously. During upgrade, new Docker containers for the OpenContrail analytics nodes are spawned. During update, the corresponding Docker images are updated.
Upgrade or update the OpenContrail controller nodes. During upgrade, new Docker containers for the OpenContrail controller nodes are spawned. During update, the corresponding Docker images are updated. All nodes are upgraded or updated simultaneously except the one that meantime runs the contrail-control service and is upgraded or updated after other nodes.
Upgrade the OpenContrail packages on the OpenStack controller nodes simultaneously.
Start the Neutron server services.
Upgrade the OpenContrail data plane nodes one by one with the workloads migration if needed since this step implies downtime of the Networking service.
Verify the OpenContrail services statuses.

Prerequisites¶

Before you start upgrading the OpenContrail nodes of your MCP cluster, complete the following prerequisite steps:

Configure the server and client roles for Cassandra and ZooKeeper as described in OpenContrail 3.2: Create a backup schedule for a Cassandra database and OpenContrail 3.2: Create a backup schedule for a ZooKeeper database.
If you are going to upgrade the compute nodes after upgrading the OpenContrail controller nodes, prepare the compute nodes for upgrade. Since the discovery service is removed in OpenContrail starting from v4.0, configure the endpoints statically:
1. In cluster/<name>/opencontrail/compute.yml, add the system.opencontrail.compute.upgrade class under the system.opencontrail.compute.cluster:
```
classes:
...
- system.opencontrail.compute.cluster
- system.opencontrail.compute.upgrade
...
```
2. Apply the opencontrail.compute state on the compute nodes.
3. In cluster/<name>/opencontrail/compute.yml, remove the system.opencontrail.compute.upgrade class.

Now, proceed to Prepare the cluster model.

Prepare the cluster model¶

After you complete the prerequisite steps, prepare your cluster model for the upgrade by configuring your Git project repository as described below.

To prepare the cluster model:

Log in to the Salt Master node.
In cluster/<name>/opencontrail/init.yml, change the OpenContrail repository component and version to 4.1:
```
_param:
  linux_repo_contrail_component: oc41
  opencontrail_version: 4.1
```
In cluster/<name>/openstack/dashboard.yml, add or update the OpenContrail version inside the parameters section:
```
_param:
  opencontrail_version: 4.1
```

In cluster/<name>/openstack/init.yml, define the following parameters:

_param:
  opencontrail_admin_password: <contrail_user_password>
  opencontrail_admin_user: 'contrail'

In cluster/<name>/openstack/control/init.yml, add the following class if not present:
```
classes:
...
- system.keystone.client.service.contrail
...
```
In cluster/<name>/opencontrail/analytics.yml, change the system.opencontrail.control.analytics class to system.opencontrail.control.analytics4_0.
In cluster/<name>/opencontrail/analytics.yml, add or change the opencontrail_kafka_config_dir and opencontrail_kafka_log_dir parameters. For example:
```
_param:
  opencontrail_kafka_config_dir: '/etc/kafka'
  opencontrail_kafka_log_dir: '/var/log/kafka'
```
In cluster/<name>/opencontrail/control.yml, change the system.opencontrail.control.control class to system.opencontrail.control.control4_0.

In cluster/<name>/opencontrail/analytics.yml and cluster/<name>/opencontrail/control.yml, modify the classes and the parameters sections:

classes:
...
- system.linux.system.repo.docker_legacy
...
- service.docker.host
- system.docker.client.compose
...
parameters:
  ...
  docker:
    host:
      pkgs:
        - docker-engine
        - python-docker

In cluster/<name>/opencontrail/control.yml and cluster/<name>/opencontrail/analytics.yml, remove the following classes:
```
classes:
...
- system.linux.system.repo.mcp.apt_mirantis.contrail
...
- system.linux.system.repo.mcp.apt_mirantis.cassandra
...
```
Note

These classes are related to the package repositories of OpenContrail and Cassandra and must be removed since the new version of the OpenContrail control plane is delivered as Docker containers.
In cluster/<name>/opencontrail/compute.yml, change the system.opencontrail.compute.cluster class to system.opencontrail.compute.cluster4_0.

Add the upgrade pipeline to DriveTrain:

Add the OpenContrail upgrade class to cluster/cicd/control/leader.yml:

classes:
- system.jenkins.client.job.deploy.update.upgrade_opencontrail4_0

Apply the changes:

salt -C 'I@jenkins:client' state.sls jenkins.client

Now, proceed to Upgrade the OpenContrail nodes.

Upgrade the OpenContrail nodes¶

After you prepare the cluster model of your MCP cluster, proceed to upgrading the OpenContrail controller nodes and the OpenContrail vRouter packages on the compute nodes.

Warning

During the upgrade process, the following resources are affected:

The instance(s) running on the compute nodes can be unavailable for up to 30 seconds.
The creation of new instances is not possible during the same time interval.

Therefore, you must plan a maintenance window as well as test the upgrade before applying it to production.

To upgrade the OpenContrail nodes:

Log in to the Jenkins web UI.
Open the Deploy - upgrade Opencontrail to 4.x pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
STAGE_CONTROLLERS_UPGRADE	Select this check box to run upgrade on the OpenContrail controller nodes.

Click Deploy.
If the pipeline fails with a Docker container with contrail-control service not starting or infinite restarting:
1. Log in to the container of the affected OpenContrail node.
2. Verify whether the logs contain errors UID already exists or GID already exists. For example:
```
docker logs --tail 10 --follow --timestamps <container_ID>
```
  Example of system response extract:
```
2018-09-13T06:52:15.569311246Z usermod: UID '109' already exists
2018-09-13T06:53:15.714037288Z usermod: UID '109' already exists
2018-09-13T06:54:15.863864322Z usermod: UID '109' already exists
2018-09-13T06:55:16.026689446Z usermod: UID '109' already exists
```
3. If the logs contain the above-mentioned errors:
  1. Log in to one of the affected OpenContrail nodes.
  2. Replace the contrail GID and UID with new ones that are not used by existing group and user. For example, replace 109 with 901. Run the following commands:
    usermod -u <new uid> contrail groupmod -g <new gid> contrail
  3. Rerun the Deploy - upgrade Opencontrail to 4.x pipeline.
If the pipeline fails with some OpenContrail services stuck in the initializing state:
1. Restart ZooKeeper on all analyticsdb containers.
2. Once ZooKeeper is up and running, restart the services that are stuck.
3. Rerun the Deploy - upgrade Opencontrail to 4.x pipeline.
If the pipeline fails with the Command ‘docker-compose up -d’ failed error message that causes docker-compose fail to initialize the opencontrail_analytics_1 and opencontrail_analyticsdb_1 containers:
1. Run the following command from the Salt Master node:
```
salt -C 'nal*' cmd.run "rm -rf /etc/cassandra/cassandra_analytics.yaml \
&& rm -rf /etc/cassandra/cassandra-env-analytics.sh/ \
&& rm -rf /etc/zookeeper/conf/zoo_analytics.cfg"
```
2. Rerun the Deploy - upgrade Opencontrail to 4.x pipeline.

To upgrade the OpenContrail vRouter packages on the compute nodes:

Log in to the Jenkins web UI.
Open the Deploy - upgrade Opencontrail to 4.x pipeline.

Specify the following parameters:

Parameter	Description and values
COMPUTE_TARGET_SERVERS	Add `I@opencontrail:compute` or target the global name of your compute nodes, for example `cmp001*`.
COMPUTE_TARGET_SUBSET_LIVE	Add `1` to run the upgrade first on only one of the nodes defined in the COMPUTE_TARGET_SERVERS field. After this stage is done, in the upgrade pipeline you will be asked to continue the upgrade all nodes defined in the COMPUTE_TARGET_SERVERS field.
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
STAGE_COMPUTES_UPGRADE	Select this check box to run upgrade on the compute nodes.

Click Deploy. For details how to monitor the deployment process, see: MCP Deployment Guide: View the deployment details.

The Deploy - upgrade Opencontrail to 4.x pipeline upgrade stages are as follows:

If STAGE_COMPUTES_UPGRADE is selected, the pipeline upgrades the OpenContrail packages on the compute nodes in two iterations:
1. Upgrade the sample nodes defined by COMPUTE_TARGET_SUBSET_LIVE.
2. After a manual confirmation, upgrade all compute nodes targeted by COMPUTE_TARGET_SERVERS.
If STAGE_CONTROLLERS_UPGRADE is selected, the pipeline stops the OpenContrail services on the OpenContrail nal nodes and starts the analytics and analytics db containers. Once done, the same procedure applies to the OpenContrail ntw nodes.

See also

Roll back the OpenContrail nodes

Roll back the OpenContrail nodes¶

You can roll back the OpenContrail nodes and the OpenContrail packages on the compute nodes if the upgrade fails. You can find a full log of the specific upgrade build using Build History > Console Output > Full Log in Jenkins UI.

To roll back the OpenContrail nodes:

Log in to the Salt Master node.
Revert the changes made in Prepare the cluster model section.
If you are rolling back only the compute nodes while having the OpenContrail controller nodes already upgraded to v4.x, add the system.opencontrail.compute.upgrade class to cluster/<name>/opencontrail/compute.yml.
Log in to the Jenkins web UI.
Open the Deploy - upgrade Opencontrail to 4.x pipeline.

Specify the following parameters as required:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
COMPUTE_TARGET_SERVERS	Add the same string that you used during the upgrade that failed.
COMPUTE_TARGET_SUBSET_LIVE	Add the same string that you used during the upgrade that failed.
STAGE_CONTROLLERS_ROLLBACK	Select this check box to roll back the OpenContrail controller nodes.
STAGE_COMPUTES_ROLLBACK	Select this check box to roll back the compute nodes.

Click Deploy. For details on how to monitor the deployment process, see: MCP Deployment Guide: View the deployment details.

The Deploy - upgrade Opencontrail to 4.x pipeline rollback stages are as follows:

If STAGE_CONTROLLERS_ROLLBACK is selected, the pipeline stops the OpenContrail containers with the ntw and nal nodes and starts the OpenContrail services. The process requires manual confirmations that are based on the output of the nodetool status and contrail-status commands.
If STAGE_COMPUTES_ROLLBACK is selected, the pipeline downgrades the OpenContrail packages on the compute nodes in two iterations:
1. Downgrade the sample compute nodes defined by COMPUTE_TARGET_SUBSET_LIVE.
2. After a manual confirmation, downgrade all compute nodes defined by COMPUTE_TARGET_SERVERS.

Upgrade an MCP OpenStack environment¶

This section provides the reference information to consider when creating a detailed maintenance plan for the upgrade of an OpenStack Ocata environment with OVS networking to the OpenStack Pike release as well as an OpenStack Pike environment with OVS networking to the OpenStack Queens release.

Caution

If you run an OpenContrail-based OpenStack environment, follow the procedures in this section to upgrade OpenStack only and skip the OVS-related steps. Once done, proceed with upgrading or updating OpenContrail.
To upgrade OpenContrail from version 3.2 to 4.1, use a separate procedure described in Upgrade the OpenContrail nodes from version 3.2 to 4.1.
To update OpenContrail from version 4.0 to 4.1, refer to Update the OpenContrail 4.x nodes.

Use the descriptive analysis of the techniques and tools, as well as the high-level upgrade flow included in this section to create a cloud-specific detailed upgrade procedure, assess the risks, estimate possible downtimes, plan the rollback, backup, and testing activities.

Caution

We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.

Note

This section does not cover the OpenStack packages update procedure. To perform the OpenStack packages update, see Update OpenStack packages.

Caution

Manila deprecation notice

In the MCP 2019.2.7 update, the OpenStack Manila component is being considered for deprecation. The corresponding capabilities are still available, although not further enhanced.

Starting with the 2019.2.11 maintenance update, the OpenStack Manila component will no longer be supported by Mirantis. For those existing customers who have the Manila functionality explicitly included in the scope of their contracts, Mirantis will continue to fulfill the corresponding support obligations.

OpenStack upgrade levels¶

The depth of the upgrade can differ depending on a use case. The following table describes the possible upgrade levels.

Levels of upgrade¶
Level	Description
Application upgrade	Only application packages are upgraded. Includes upgrade of application dependencies to the desired level.
System packages upgrade	Known as `apt-get upgrade`. Used to install the newest versions of all currently installed packages on the system from the sources enumerated in `/etc/apt/sources.list`. The packages are retrieved and upgraded. Under no circumstances, the packages are removed. The packages that are not yet installed but required are retrieved and installed. The new versions of the currently installed packages that cannot be upgraded without changing the installation status of another package or packages are left at their current version. An update must be performed first, so that `apt-get` can know that the new versions of the packages are available.
Kernel upgrade	Known as `apt-get dist-upgrade`. In addition to performing the function of upgrade, handles the changing dependencies with new versions of packages. The `apt-get` tool has a smart conflict resolution system and attempts to upgrade the most important packages at the expense of less important ones if necessary. The `dist-upgrade` command can remove some packages. The `/etc/apt/sources.list` file contains the list of locations from which the desired package files should be retreived. Reboot might be needed after this type of upgrade.
Release upgrade	Known as a `do-release-upgrade` upgrade of a Linux distribution to a newer major release.

Note

To minimize the control plane downtime, we recommend performing the application level upgrade first. When the OpenStack component is fully upgraded, proceed with the system upgrade. For details, see The OpenStack formulas structure.

OpenStack upgrade tools overview¶

This section provides the overview of the tools used during the OpenStack upgrade.

MCP Drivetrain provides the following OpenStack-related upgrade pipelines:

Deploy - upgrade control VMs
Deploy - upgrade computes
Deploy - upgrade OVS gateway

Each job consists of different stages that are described in details in the related sections below.

The control plane upgrade pipeline¶

The Deploy - upgrade control VMs pipeline job is designed to upgrade the control plane component of OpenStack. The workflow of the job includes the following stages:

Deploy - upgrade control VMs pipeline job workflow¶
Stage	Description
Pre-upgrade	Only non destructive actions are applied during this phase. Basic API verification is performed. The job is launched on all target servers before moving to the next stage. Online `dbsyncs` is called explicitly to verify whether the actual upgrade can be initiated. Caution Online database synchronization is required before the upgrade to a new release. Since `dbsyncs` can take a lot of time, the manual online synchronization before the upgrade is recommended.
Stop OpenStack services	Stops all OpenStack Python services on the target servers. This does not affect the data plane services such as OVS or KVM.
Upgrade OpenStack	Upgrades the OpenStack Python code on the target nodes sequentially. Once the stage is finalized on a specific target node, the basic API checks are performed. No workload downtime is expected.
Upgrade OS	Launches only if `OS_UPGRADE` or `OS_DIST_UPGRADE` is checked. A reboot can be performed if required. When node is back online, the basic service checks are performed.

The data plane upgrade pipelines¶

The Deploy - upgrade OVS gateway and Deploy - upgrade computes pipeline jobs are designed to upgrade the data plane nodes of OpenStack. The resources running on the gateway nodes are smoothly migrated to other gateway nodes to minimize workload downtime. The workflow of these jobs include the following stages:

Deploy - upgrade OVS gateway and Deploy - upgrade computes pipeline jobs workflow¶
Stage	Description
Pre-upgrade	Ensures that Neutron agents and compute services on the target hosts are alive. Only non destructive actions are applied during this stage.
Upgrade pre: migrate resources	Performs the smooth resource migration to minimize the workload downtime. Neutron agents on node are set to the `admin_disabled state` to ensure their quick migration to new nodes.
Upgrade OpenStack	Upgrades the OpenStack python code. Once completed, the basic API verification is performed. Before proceeding to the next stage, wait for the agents to become alive.
Upgrade OS	Launches only if `OS_UPGRADE` or `OS_DIST_UPGRADE` is checked. Reboot can be performed if required. When the node is back online, the basic service checks are performed.
Upgrade post: enable resources	Verifies that agents and services on nodes are up and adds them back to scheduling.
Post upgrade	Performs the post upgrade cleanup of old configurations and temporary files. Only non destructive actions are applied during this phase.

The OpenStack formulas structure¶

This section describes the structure of the OpenStack formulas related to upgrades. This defines the upgrade API for the formulas that consist of the states described in the table below. By using these states, you can build a flexible upgrade logic for a particular use-case.

**The OpenStack formulas structure**¶
State	Description
`<app>.upgrade.service_running`	Verify that all services for a particular application are enabled for autostart and running.
`<app>.upgrade.service_stopped`	Verify that all services for a particular application are disabled for autostart and dead.
`<app>.upgrade.pkg_latest`	Verify that the packages used by a particular application are upgraded to the latest available version. This will not upgrade the data plane packages like QEMU and OVS since usually a minimal required version in OpenStack services is really old. The data plane packages should be upgraded separately by `apt-get upgrade` or `apt-get dist-upgrade`. Application of this state will not autostart service.
`<app>.upgrade.render_config`	Verify that the configuration is rendered to an actual version.
`<app>.upgrade.pre`	We assume this state is applied on all nodes in the cloud before running the upgrade. Only non-destructive actions will be applied during this phase. Perform the built-in service check such as `keystone-manage doctor` and `nova-status upgrade`.
`<app>.upgrade.upgrade.pre`	Mostly applicable for the data plane nodes. During this phase, resources will be gracefully removed from the current node if it is possible. The services of the upgraded applications will be set to the `admin disabled` state to make sure that a node will not participate in the resources scheduling. For example on the `gtw` nodes, this will set all agents to the `admin disabled` state and move all routers to other agents.
`<app>.upgrade.upgrade`	Upgrade applications on a particular target node. Stop services, render configuration, install new packages, run offline `dbsync` for the `ctl` nodes, start services. The data plane should not be affected, only OpenStack Python services.
`<app>.upgrade.upgrade.post`	Add services back to scheduling.
`<app>.upgrade.post`	This phase should be launched only when the cloud upgrade is completed. During this phase, some services may be restarted and minimal downtime may occur.
`<app>.upgrade.verify`	Perform the basic health checks such as the API CRUD operations, verification of the failed network agents and compute services.

The upgrade pillar has the following structure:

<app>:
  upgrade:
    upgrade_enabled: true|false
    old_release: <release of openstack we upgrade from>
    new_release: <release of ipenstack we upgrade to>

Also, an application can use additional upgrade parameters to control the upgrade behaviour. For example, to disable the Neutron router migration, configure the pillar as follows:

neutron:
  upgrade:
    resource_migration:
      l3:
        enabled: false

OpenStack upgrade workflow¶

Normally, an OpenStack upgrade includes the following stages:

OpenStack upgrade stages¶
#	Stage	Description
1	Planning	Includes the creation of the maintenance plan.
2	Pre-upgrade	Includes procedures that do not affect workability of the current OpenStack version such as running a QA cycle, verifying infrastructure, configuring monitoring of workloads and services. Also, during this stage, the backups are created and additional services, servers, and systems are installed to facilitate the upgrade process according to the plan.
3	Upgrade	During this stage, the actual upgrade takes place.
3.1	Control plane upgrade	During this stage, the control plane is being upgraded. It should not have an impact on the data plane, but there might be compatibility issues between the data plane and control plane of different versions. To minimize the control plane downtime, we recommend performing the quickest upgrade depth, which is the application level upgrade first. And, once full OpenStack upgrade is completed, perform the dist-upgrade or even release upgrade by removing one controller node from the cloud, upgrading it, and adding it back.
3.2	Data plane upgrade	During this stage, the servers that host the end-user data applications including the compute, storage, and gateway nodes are being upgraded. Depending on the upgrade requirements, any kind of upgrade depths can be applied.
4	Post upgrade	Includes procedures that will not affect workability, post upgrade testing activities, and cleanup.

Warning

Before you perform the upgrade on a production environment, accomplish the procedure on a staging environment. If the staging environment does not exist, adapt the exact cluster model and launch it inside the cloud as a heat stack, which will act as a staging environment.

Plan the OpenStack upgrade¶

As a result of the planning stage of the OpenStack upgrade, a detailed maintenance plan is created.

The maintenance plan must include the following parts:

A strict step-by-step upgrade procedure
A rollback plan
A maintenance window schedule for each upgrade phase

Note

The upgrade flow is thoroughly selected by engineers in correspondence with the workload and requirements of a particular cloud.

After the maintenance plan is successfully tested on a staging environment, you can proceed with the actual upgrade in production.

Caution

Manila deprecation notice

In the MCP 2019.2.7 update, the OpenStack Manila component is being considered for deprecation. The corresponding capabilities are still available, although not further enhanced.

Limitations¶

The following are the limitations of the OpenStack upgrade pipelines:

The pipelines upgrade only the OpenStack component. The upgrade of other VCP components such as RabbitMQ and MySQL is out of scope and should be done independently.

Caution

We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.
The upgrade of StackLight LMA is out of scope. To obtain the latest version, upgrade StackLight LMA as described in Upgrade StackLight LMA to Build ID 2019.2.0.

See also

Prerequisites¶

Before you proceed with the OpenStack upgrade, verify the following:

Upgrade of MCP is done to the latest build ID. Verify that you have updated DriveTrain including Aptly, Gerrit, Jenkins, Reclass, Salt formulas, and their subcomponents to the current MCP release version. Otherwise, the current MCP product documentation is not applicable to your MCP deployment.
All OpenStack formula states like (nova, neutron, etc) can be launched without errors.
Online dbsyncs for services are performed before the upgrade maintenance window since this task can take significant time based on a cloud size.
No failed OpenStack services or nodes are present in the cloud.
Utilization of the disk space is up to 80% on each target node, which include the ctl*, prx*, gtw*, and cmp* nodes.
There is enough disk space on the node that will store the backups for the MySQL databases and the existing cluster model.
For the MCP 2019.2.3 OpenStack environments, verify that the python-tornado package is installed from the latest OpenStack version or the OpenStack version to which you are going to upgrade. Restart the Salt minion after the package installation.

See also

Limitations

Upgrade OpenStack from Ocata to Pike¶

This section includes the detailed procedure to upgrade OpenStack from Ocata to Pike.

Caution

We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.

Perform the pre-upgrade activities¶

The pre-upgrade stage includes the activities that do not affect workability of a currently running OpenStack version as well as the backups creation.

To prepare your OpenStack deployment for the upgrade:

Perform the steps described in Configure load balancing for Horizon.
On one of the controller nodes, perform the online database migrations for the following services:

Note

The database migrations can be time-consuming and create high load on CPU and RAM. We recommend that you perform the migrations in batches.
- Nova:
```
nova-manage db online_data_migrations
```
- Cinder:
```
cinder-manage db online_data_migrations
```
- Ironic:
```
ironic-dbsync online_data_migrations
```

Prepare the target nodes for the upgrade:

Verify that the OpenStack cloud configuration file is present on the controller nodes:

salt -C 'I@keystone:client:os_client_config' state.apply keystone.client.os_client_config

Get the list of all upgradable OpenStack components:

salt-call config.get orchestration:upgrade:applications --out=json

Example of system response:

{
    "<model_of_salt_master_name>": {
        "nova": {
            "priority": 1100
        },
        "heat": {
            "priority": 1250
        },
        "keystone": {
            "priority": 1000
        },
        "horizon": {
            "priority": 1800
        },
        "cinder": {
            "priority": 1200
        },
        "glance": {
            "priority": 1050
        },
        "neutron": {
            "priority": 1150
        },
        "designate": {
            "priority": 1300
        }
    }
}

Range the components from the output by priority. For example:
```
keystone
glance
nova
neutron
cinder
heat
designate
horizon
```
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters.

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master.

For each target node, get the list of the installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Verify that the outdated version of the nova-osapi_compute service is not running:

salt -C 'I@galera:master' mysql.query nova 'select services.id, services.host, \
services.binary, services.version from services where services.version < 15'

If the system output contains the nova-osapi_compute service, delete it by running the following commands from any OpenStack controller node:

source keystonercv3
openstack compute service delete <nova-osapi_compute_service_id>

Match the lists of upgradable OpenStack components with the lists of installed applications for each target node.
Apply the following states to each target node for each installed application in strict order of priority:
Warning

During upgrade, the applications running on the target nodes use the KeystoneRC metadata. To guarantee that the KeystoneRC metadata is exported to mine, verify that you apply the keystone.upgrade.pre formula to the keystone:client:enabled node:
```
salt -C 'I@keystone:client:enabled' state.sls keystone.upgrade.pre
```
```
salt <node_name> state.apply <component_name>.upgrade.pre
salt <node_name> state.apply <component_name>.upgrade.verify
```
For example, for Nova installed on the cmp01 compute node, run:
```
salt cmp01 state.apply nova.upgrade.pre
salt cmp01 state.apply nova.upgrade.verify
```

On the clouds of medium and large sizes, you may want to automate the step 3 to prepare the target nodes for the upgrade. Use the following script as an example of possible automatization.

 #!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | \
sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.pre
      salt $node state.apply $component.upgrade.verify
    fi
  done
done

Add the testing workloads to each compute host and monitoring and verify the following:
- The cloud services are monitored as expected.
- There are free resources (disk, RAM, CPU) on the kvm, ctl, cmp, and other nodes.
Back up the OpenStack databases as described in Back up and restore a MySQL database.

Adjust the cluster model:

Include the upgrade pipeline job to DriveTrain:
1. Add the following lines to classes/cluster/<cluster_name>/cicd/control/leader.yml:
  
  Caution
  
  If your MCP OpenStack deployment includes the OpenContrail component, do not specify the system.jenkins.client.job.deploy.update.upgrade_ovs_gateway class.
```
classes:
 - system.jenkins.client.job.deploy.update.upgrade
 - system.jenkins.client.job.deploy.update.upgrade_ovs_gateway
 - system.jenkins.client.job.deploy.update.upgrade_compute
```
2. Apply the jenkins.client state on the Jenkins nodes:
```
salt -C 'I@jenkins:client' state.sls jenkins.client
```

Set the parameters in classes/cluster/<cluster_name>/infra/init.yml as follows:

parameters:
  _param:
    openstack_version: pike
    openstack_old_version: ocata
    openstack_upgrade_enabled: true

(Optional) Upgrade pillars of all supported OpenStack applications are already included in the Reclass system level. In case of a non-standard setup, the list of the OpenStack applications on each node should be checked and upgrade pillars added for the Openstack applications that do not contain them. For example:

<app>:
  upgrade:
    enabled: ${_param:openstack_upgrade_enabled}
    old_release: ${_param:openstack_old_version}
    new_release: ${_param:openstack_version}

Note

On the clouds of medium and large sizes, you may want to automate this step. To obtain the list of the OpenStack applications running on a node, use the following script.

#!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  node_openstack_app=" "
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      node_openstack_app="$node_openstack_app $component"
    fi
  done
  echo "$node : $node_openstack_app"
done

Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```

Prepare the target nodes for the upgrade:

Get the list of all upgradable OpenStack components:

salt-call config.get orchestration:upgrade:applications --out=json

Range the components from the output by priority.
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master

For each target node, get the list of installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of upgradable OpenStack components with the lists of installed applications for each target node.
Apply the following states to each target node for each installed application in strict order of priority:
```
salt <node_name> state.apply <component_name>.upgrade.pre
```

Note

On the clouds of medium and large sizes, you may want to automate this step. Use the following script as an example of possible automatization.

 #!/bin/bash
#List of formulas that implements upgrade API
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.pre
    fi
  done
done

Apply the linux.system.repo state on the target nodes.
Proceed to Upgrade the OpenStack control plane.

Upgrade the OpenStack control plane¶

The OpenStack control plane upgrade stage includes upgrading of the OpenStack services APIs. We recommend that you select the quickest upgrade depth that does not include running OS_UPGRADE or OS_DIST_UPGRADE to minimize the API downtime. You can perform both OS_UPGRADE and OS_DIST_UPGARDE during the post-upgrade stage if required.

To upgrade the OpenStack VCP:

Log in to the Jenkins web UI.
Run the Deploy - upgrade control VMs pipeline on the OpenStack controller nodes in the interactive mode setting the parameters as follows:
- TARGET_SERVERS=’ctl*’
- MODE=INTERACTIVE mode to get the detailed description of the pipeline flow through the stages
Verify that the control plane is up and the OpenStack services from the data plane are reconnected and working correctly with the newly upgraded control plane.
Run the Deploy - upgrade control VMs on the proxy nodes setting TARGET_SERVERS=’prx*’.
Verify that the public API is accessible and Horizon is working.
Perform the upgrade of other control plane nodes where required depending on your deployment.
Verify that the control plane is upgraded to the intended OpenStack release, APIs work correctly and are available, and the services enable the users to manage their resources.

Note

The new features of the intended OpenStack release are not available till the data plane nodes are upgraded.
Proceed to Upgrade the OpenStack data plane.

Upgrade the OpenStack data plane¶

The OpenStack data plane includes the servers that host end-user data applications. More specifically, these hosts include compute, storage, and gateway nodes. Depending on the upgrade requirements, you can apply any kind of the upgrade depths while upgrading the data plane.

To upgrade the data plane of your OpenStack deployment:

To upgrade the gateway nodes, select one of the following options:

Caution

Skip this step if your MCP OpenStack deployment includes the OpenContrail component because such configuration does not contain gateway nodes.
- Non-HA routers are present in the cloud:
  1. Migrate the non-HA routers from the target nodes using the Neutron service and the following commands in particular:
    - l3-agent-router-add
    - l3-agent-router-remove
    - router-list-on-l3-agent
  2. Log in to the Jenkins web UI.
  3. Run the Deploy - upgrade OVS gateway pipeline for the gateway nodes which you have migrated the workloads from.
    
    Note
    
    Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
    
    Note
    
    Since all resources have already been migrated from the nodes, we recommend performing the full upgrade including OS_UPGRADE and OS_DIST_UPGRADE.
  4. Migrate the non-HA routers back and rerun the Deploy - upgrade OVS gateway pipeline for the rest of the gateway nodes.
  5. Verify that the gateway components are reconnected to the control plane.
- Non-HA routers are not present in the cloud:
  1. Log in to the Jenkins web UI.
  2. Run the Deploy - upgrade OVS gateway pipeline for all gateway nodes specifying TARGET_SERVERS=’gtw*’.
    
    Note
    
    Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
Verify that the gateway components are reconnected to the control plane.

Caution

Skip this step if your MCP OpenStack deployment includes the OpenContrail component because such configuration does not contain gateway nodes.
Upgrade the OpenStack compute nodes.
1. Estimate and minimize the risks and address the limitations of live migration.
  
  The limitations of the live migration technology include:
  
  Warning
  
  Before proceeding with live migration in a production environment, assess these risks thoroughly.
  - The CPU of a source compute node must have a feature set that is a subset of a feature set of the target compute CPU. Therefore, the migration should be performed between the compute nodes with identical CPUs with, preferably, identical microcode versions.
  - During the live migration, the entire memory state of a VM must be copied to another server. In the first place, the memory pages that are being changed at a slower rate are copied. After, the system copies the most active memory pages. If the number of pages that are being written to all the time is big, the migration process will never finish. High-memory, high-load Windows virtual machines are known to have this particular issue.
  - During the live migration, a very short downtime (1-2 seconds max) occurs. The reason for the downtime is that when the memory is copied, the execution context (VCPU state) has to be copied as well, and the execution itself must be switched to a new virtual machine. In addition to a short downtime, this causes a short clock lag on the migrated virtual machine. Therefore, if the migrated machine is hosting a part of a clustered service or system, the downtime and resulting time lag may have an adverse impact on the whole system.
  - The QEMU version installed on the source and target hosts should be the same and later than 2.5.
2. Perform the live migration of workloads.
3. Log in to the Jenkins web UI.
4. Run the Deploy - upgrade computes pipeline to upgrade the OpenStack compute nodes which you have migrated the workloads from. It is essential that you upgrade the compute nodes by small batches.
  
  Caution
  
  The impact of the upgrade process should be calculated for each compute node during the planning stage as this step may take a significant amount of time.
  
  Note
  
  Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
5. Migrate the workloads back and rerun the Deploy - upgrade computes pipeline for the rest of the compute nodes.
Verify that the compute nodes are reconnected to the control plane.
Proceed to Perform the post-upgrade activities.

Perform the post-upgrade activities¶

The post-upgrade activities include the post-upgrade testing cycle and cleanup.

To finalize the upgrade:

Perform the full verification cycle of your MCP OpenStack deployment.

Verify that the following variables are set in the classes/cluster/<cluster_name>/infra/init.yml file:

parameters:
  _param:
    openstack_upgrade_enabled: false
    openstack_version: pike
    openstack_old_version: ocata

Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```
Remove the test workloads/monitoring.

Remove the upgrade leftovers that were created by applying the <app>.upgrade.post state:

Get the list of all upgradable OpenStack components. For example:

salt cfg01* config.get orchestration:upgrade:applications --out=json

Example of system response:

{
    "<model_of_salt_master_name>": {
        "nova": {
            "priority": 1100
        },
        "heat": {
            "priority": 1250
        },
        "keystone": {
            "priority": 1000
        },
        "horizon": {
            "priority": 1800
        },
        "cinder": {
            "priority": 1200
        },
        "glance": {
            "priority": 1050
        },
        "neutron": {
            "priority": 1150
        },
        "designate": {
            "priority": 1300
        }
    }
}

Range the components from the output by priority. For example:
```
keystone
glance
nova
neutron
cinder
heat
designate
horizon
```
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
Note

The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master

For each target node, get the list of installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of upgradable OpenStack components with the lists of installed applications for each target node.
Apply the following states to each target node for each installed application in strict order of priority:
```
salt <node_name> state.apply <component_name>.upgrade.post
```
For example, for Nova installed on the cmp01 compute node, run:
```
salt cmp01 state.apply nova.upgrade.post
```

Note

On the clouds of medium and large sizes, you may want to automate this step. Use the following script as an example of possible automatization. Before running the script, verify that you define the $cluster_domain and $salt_master_hostname variables.

 #!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt cfg01* config.get orchestration:upgrade:applications   --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt-key | grep $cluster_domain | grep -v $salt_master_hostname | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications   --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.post
    fi
  done
done

Set the following variables in classes/cluster/<cluster_name>/infra/init.yml:

parameters:
  _param:
    openstack_upgrade_enabled: false
    openstack_version: pike
    openstack_old_version: pike
    ...

Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```

Upgrade OpenStack from Pike to Queens¶

This section includes the detailed procedure to upgrade OpenStack from Pike to Queens.

Caution

We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.

Perform the pre-upgrade activities¶

The pre-upgrade stage includes the activities that do not affect workability of a currently running OpenStack version as well as the backups creation.

To prepare your OpenStack deployment for the upgrade:

Perform the steps described in Configure load balancing for Horizon.
On one of the controller nodes, perform the online database migrations for the following services:

Note

The database migrations can be time-consuming and create high load on CPU and RAM. We recommend that you perform the migrations in batches.
- Nova:
```
nova-manage db online_data_migrations
```
- Cinder:
```
cinder-manage db online_data_migrations
```
- Ironic:
```
ironic-dbsync online_data_migrations
```

Prepare the target nodes for the upgrade.

Note

On the clouds of medium and large sizes, you may want to automate the steps below to prepare the target nodes for the upgrade. Use the following script as an example of possible automatization.

#!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/[//g' -e 's/]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.pre
      salt $node state.apply $component.upgrade.verify
    fi
  done
done

Verify that the OpenStack cloud configuration file is present on the controller nodes:

salt -C 'I@keystone:client:os_client_config' state.apply keystone.client.os_client_config

Get the list of all upgradable OpenStack components:

salt-call config.get orchestration:upgrade:applications --out=json

Example of system response:

{
    "<model_of_salt_master_name>": {
        "nova": {
            "priority": 1100
        },
        "heat": {
            "priority": 1250
        },
        "keystone": {
            "priority": 1000
        },
        "horizon": {
            "priority": 1800
        },
        "cinder": {
            "priority": 1200
        },
        "glance": {
            "priority": 1050
        },
        "neutron": {
            "priority": 1150
        },
        "designate": {
            "priority": 1300
        }
    }
}

Range the components from the output by priority. For example:
```
keystone
glance
nova
neutron
cinder
heat
designate
horizon
```
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master

For each target node, get the list of the installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of upgradable OpenStack components with the lists of installed applications for each target node.
If the public endpoint for the Nova placement API was not created before:
1. Add the following class to the Reclass model in the classes/cluster/<cluster_name>/openstack/proxy.yml file:
```
classes:
…
- system.nginx.server.proxy.openstack.placement
```
2. Refresh pillars on the proxy nodes:
```
salt 'prx*' saltutil.refresh_pillar
```
3. Apply the nginx state on the proxy nodes:
```
salt 'prx*' state.sls nginx
```
Apply the following states to each target node for each installed application in strict order of priority:
Warning

During upgrade, the applications running on the target nodes use the KeystoneRC metadata. To guarantee that the KeystoneRC metadata is exported to mine, verify that you apply the keystone.upgrade.pre formula to the keystone:client:enabled node:
```
salt -C 'I@keystone:client:enabled' state.sls keystone.upgrade.pre
```
```
salt <node_name> state.apply <component_name>.upgrade.pre
salt <node_name> state.apply <component_name>.upgrade.verify
```
For example, for Nova installed on the cmp01 compute node, run:
```
salt cmp01 state.apply nova.upgrade.pre
salt cmp01 state.apply nova.upgrade.verify
```

Add the testing workloads to each compute host and monitoring and verify the following:
- The cloud services are monitored as expected.
- There are free resources (disk, RAM, CPU) on the kvm, ctl, cmp, and other nodes.
Back up the OpenStack databases as described in Back up and restore a MySQL database.
If Octavia is enabled, move the Octavia certificates from the gtw01 to the Salt Master node.

Adjust the cluster model for the upgrade:

Include the upgrade pipeline job to DriveTrain:
1. Add the following lines to classes/cluster/<cluster_name>/cicd/control/leader.yml:
  
  Caution
  
  If your MCP OpenStack deployment includes the OpenContrail component, do not specify the system.jenkins.client.job.deploy.update.upgrade_ovs_gateway class.
```
classes:
 - system.jenkins.client.job.deploy.update.upgrade
 - system.jenkins.client.job.deploy.update.upgrade_ovs_gateway
 - system.jenkins.client.job.deploy.update.upgrade_compute
```
2. Apply the jenkins.client state on the Jenkins nodes:
```
salt -C 'I@jenkins:client' state.sls jenkins.client
```

Set the parameters in classes/cluster/<cluster_name>/infra/init.yml as follows:

parameters:
  _param:
    openstack_version: queens
    openstack_old_version: pike
    openstack_upgrade_enabled: true

(Optional) To upgrade Gnocchi to the version supported in Queens, modify the classes/cluster/<cluster_name>/openstack/init.yml file:
1. Define the following parameters:
```
parameters:
  _param:
    gnocchi_version: 4.2
    gnocchi_old_version: 4.0
```
2. Update the Redis server version from 3.0 to 5.0:
```
parameters:
  redis:
    server:
      version: 5.0
```

(Optional) The upgrade pillars of all supported OpenStack applications are already included to the system level of Reclass. In case of a non-standard setup, check the list of OpenStack applications on each node and add the upgrade pillars to the OpenStack applications that do not contain them. For example:

<app>:
  upgrade:
    enabled: ${_param:openstack_upgrade_enabled}
    old_release: ${_param:openstack_old_version}
    new_release: ${_param:openstack_version}

Note

To obtain the list of the OpenStack applications running on a node, use the following script.

#!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  node_openstack_app=" "
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      node_openstack_app="$node_openstack_app $component"
    fi
  done
  echo "$node : $node_openstack_app"
done

Enable the Keystone v3 client configuration for the Keystone resources creation by editing classes/cluster/<cluster_name>/openstack/control/init.yml:
```
classes:
- system.keystone.client.v3
```
Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```
Apply the salt.minion state:
```
salt '*' state.apply salt.minion
```

Prepare the target nodes for the upgrade:

Get the list of all upgradable OpenStack components:

salt-call config.get orchestration:upgrade:applications --out=json

Range the components from the output by priority.
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters.

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master.

For each target node, get the list of installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of upgradable OpenStack components with the lists of installed applications for each target node.
Apply the following states to each target node for each installed application in strict order of priority:
```
salt <node_name> state.apply <component_name>.upgrade.pre
```

Note

On the clouds of medium and large sizes, you may want to automate this step. Use the following script as an example of possible automatization.

 #!/bin/bash
#List of formulas that implements upgrade API
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority" :       $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.pre
    fi
  done
done

Apply the linux.system.repo state on the target nodes.
Proceed to Upgrade the OpenStack control plane.

Upgrade the OpenStack extra components¶

The extra OpenStack components include:

Manila
Barbican
Tenant Telemetry (Aodh, Ceilometer, Panko, and Gnocchi)

MCP supports the upgrade of the extra OpenStack components starting from the Pike OpenStack release deployed with the 2018.11.0 MCP version.

By default, these components are running on dedicated VCP nodes and can be upgraded during separate maintenance windows using the Deploy - upgrade control VMs job as described in Upgrade the OpenStack control plane.

Upgrade the OpenStack control plane¶

The OpenStack control plane upgrade stage includes upgrading of the OpenStack services APIs. To minimize the API downtime, we recommend that you select the quickest upgrade depth that does not include running OS_UPGRADE or OS_DIST_UPGRADE. You can perform both OS_UPGRADE and OS_DIST_UPGRADE during the post-upgrade stage if required.

To upgrade the OpenStack VCP:

Log in to the Jenkins web UI.
If Ironic will be upgraded, perform the following steps to upgrade Ironic Conductor before the Ironic API. The nova-compute service running on the Ironic Conductor nodes must be upgraded only after the Nova Controller has been upgraded.

Caution

The upgrade of Ironic is available starting from the MCP 2019.2.6 update. See MCP Release Notes: Maintenance updates for details.
1. Verify that the following variables are set on each Ironic Conductor node in Reclass in the classes/nodes/_generated/<node_name>.yml files:
```
parameters:
  _param:
    nova:
      upgrade:
        enabled: false
```
2. Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```
3. Run the Deploy - upgrade control VMs pipeline job on the Ironic Conductor nodes in the interactive mode setting the TARGET_SERVERS parameter to bmt*.
4. Once the pipeline job execution is finished, verify that the following variables are set for each Ironic Conductor node in Reclass in the classes/nodes/_generated/<node_name>.yml files:
```
parameters:
  _param:
    nova:
      upgrade:
        enabled: true
    ironic:
      upgrade:
        enabled: false
```
5. Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```
Run the Deploy - upgrade control VMs pipeline job on the OpenStack controller nodes in the interactive mode setting the parameters as follows:
- TARGET_SERVERS=ctl*
  
  After you upgrade the ctl nodes, define the following values one by one to upgrade additional OpenStack components from Pike to Queens as required:
  - TARGET_SERVERS=share* to upgrade the Manila control plane
  - TARGET_SERVERS=mdb* to upgrade the Tenant Telemetry including Ceilometer, Gnocchi, Aodh, and Panko
  - TARGET_SERVERS=kmn* to upgrade Barbican
  - TARGET_SERVERS=bmt* to upgrade Ironic
    
    Note
    
    During the second execution of the pipeline job on the Ironic Conductor nodes, only nova-compute will be upgraded since Ironic has already been upgraded in step 2.
- MODE=INTERACTIVE mode to get the detailed description of the pipeline job flow through the stages
Verify that the control plane is up and the OpenStack services from the data plane are reconnected and working correctly with the newly upgraded control plane.
Run the Deploy - upgrade control VMs pipeline job on the proxy nodes with TARGET_SERVERS=’prx*’ set.
Verify that the public API is accessible and Horizon is working.
Perform the upgrade of other control plane nodes where required depending on your deployment.
Verify that the control plane is upgraded to the intended OpenStack release, APIs work correctly and are available, and the services enable the users to manage their resources.

Note

The new features of the intended OpenStack release are not available till the data plane nodes are upgraded.
Proceed to Upgrade the OpenStack data plane.

See also

Upgrade the OpenStack data plane¶

To upgrade the data plane of your OpenStack deployment:

To upgrade the gateway nodes, select one of the following options:

Caution

Skip this step if your MCP OpenStack deployment includes the OpenContrail component because such configuration does not contain gateway nodes.
- Non-HA routers are present in the cloud:
  1. Migrate the non-HA routers from the target nodes using the Neutron service and the following commands in particular:
    - l3-agent-router-add
    - l3-agent-router-remove
    - router-list-on-l3-agent
  2. Log in to the Jenkins web UI.
  3. Run the Deploy - upgrade OVS gateway pipeline for the gateway nodes which you have migrated the workloads from.
    
    Note
    
    Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
    
    Note
    
    Since all resources have already been migrated from the nodes, we recommend performing the full upgrade including OS_UPGRADE and OS_DIST_UPGRADE.
  4. Migrate the non-HA routers back and rerun the Deploy - upgrade OVS gateway pipeline for the rest of the gateway nodes.
  5. Verify that the gateway components are reconnected to the control plane.
- Non-HA routers are not present in the cloud:
  1. Log in to the Jenkins web UI.
  2. Run the Deploy - upgrade OVS gateway pipeline for all gateway nodes specifying TARGET_SERVERS=’gtw*’.
    
    Note
    
    Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
Verify that the gateway components are reconnected to the control plane.

Caution

Skip this step if your MCP OpenStack deployment includes the OpenContrail component because such configuration does not contain gateway nodes.
Upgrade the OpenStack compute nodes.
1. Estimate and minimize the risks and address the limitations of live migration.
  
  The limitations of the live migration technology include:
  
  Warning
  
  Before proceeding with live migration in a production environment, assess these risks thoroughly.
  - The CPU of a source compute node must have a feature set that is a subset of a feature set of the target compute CPU. Therefore, the migration should be performed between the compute nodes with identical CPUs with, preferably, identical microcode versions.
  - During the live migration, the entire memory state of a VM must be copied to another server. In the first place, the memory pages that are being changed at a slower rate are copied. After, the system copies the most active memory pages. If the number of pages that are being written to all the time is big, the migration process will never finish. High-memory, high-load Windows virtual machines are known to have this particular issue.
  - During the live migration, a very short downtime (1-2 seconds max) occurs. The reason for the downtime is that when the memory is copied, the execution context (VCPU state) has to be copied as well, and the execution itself must be switched to a new virtual machine. In addition to a short downtime, this causes a short clock lag on the migrated virtual machine. Therefore, if the migrated machine is hosting a part of a clustered service or system, the downtime and resulting time lag may have an adverse impact on the whole system.
  - The QEMU version installed on the source and target hosts should be the same and later than 2.5.
2. Perform the live migration of workloads.
3. Log in to the Jenkins web UI.
4. Run the Deploy - upgrade computes pipeline to upgrade the OpenStack compute nodes which you have migrated the workloads from. It is essential that you upgrade the compute nodes by small batches.
  
  Caution
  
  The impact of the upgrade process should be calculated for each compute node during the planning stage as this step may take a significant amount of time.
  
  Note
  
  Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
5. Migrate the workloads back and rerun the Deploy - upgrade computes pipeline for the rest of the compute nodes.
Verify that the compute nodes are reconnected to the control plane.
Proceed to Perform the post-upgrade activities.

Perform the post-upgrade activities¶

The post-upgrade activities include the post-upgrade testing cycle and cleanup.

To finalize the upgrade:

Perform the full verification cycle of your MCP OpenStack deployment.

Verify that the following variables are set in the classes/cluster/<cluster_name>/infra/init.yml file:

parameters:
  _param:
    openstack_upgrade_enabled: false
    openstack_version: queens
    openstack_old_version: pike

(Optional) If Gnocchi was upgraded, define the following parameters in classes/cluster/<cluster_name>/openstack/init.yml:
```
parameters:
  _param:
    gnocchi_version: 4.2
    gnocchi_old_version: 4.0
```
Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```
Remove the test workloads/monitoring.

Remove the upgrade leftovers that were created by applying the <app>.upgrade.post state:

Get the list of all upgradable OpenStack components. For example:

salt cfg01* config.get orchestration:upgrade:applications --out=json

Example of system response:

{
    "<model_of_salt_master_name>": {
        "nova": {
            "priority": 1100
        },
        "heat": {
            "priority": 1250
        },
        "keystone": {
            "priority": 1000
        },
        "horizon": {
            "priority": 1800
        },
        "cinder": {
            "priority": 1200
        },
        "glance": {
            "priority": 1050
        },
        "neutron": {
            "priority": 1150
        },
        "designate": {
            "priority": 1300
        }
    }
}

Range the components from the output by priority. For example:
```
keystone
glance
nova
neutron
cinder
heat
designate
horizon
```
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
Note

The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master

For each target node, get the list of installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of upgradable OpenStack components with the lists of installed applications for each target node.
Apply the following states to each target node for each installed application in strict order of priority:
```
salt <node_name> state.apply <component_name>.upgrade.post
```
For example, for Nova installed on the cmp01 compute node, run:
```
salt cmp01 state.apply nova.upgrade.post
```

Note

 #!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt cfg01* config.get orchestration:upgrade:applications   --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt-key | grep $cluster_domain | grep -v $salt_master_hostname | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications   --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.post
    fi
  done
done

Set the following variables in classes/cluster/<cluster_name>/infra/init.yml:

parameters:
  _param:
    openstack_upgrade_enabled: false
    openstack_version: queens
    openstack_old_version: queens

If Gnocchi was upgraded, set the following parameters in the same file:

parameters:
  _param:
    gnocchi_version: 4.2
    gnocchi_old_version: 4.2

Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```

Upgrade Galera¶

This section includes the instruction on how to upgrade the Galera cluster. You can upgrade Galera either automatically using the corresponding Jenkins pipeline or manually.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Upgrade Galera automatically¶

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to upgrade the Galera cluster automatically through Jenkins using the Deploy - upgrade Galera cluster pipeline job.

To upgrade Galera:

Log in to the Salt Master node.
Open the cluster level of your deployment model.
Include the Galera upgrade pipeline job to DriveTrain:
1. In the classes/cluster/<cluster_name>/cicd/control/leader.yml file, add the following class:
```
classes:
- system.jenkins.client.job.deploy.update.upgrade_galera
```
2. Apply the jenkins.client state on the Jenkins nodes:
```
salt -C 'I@jenkins:client' state.sls jenkins.client
```
3. In the classes/cluster/<cluster_name>/infra/init.yml file, set the openstack_upgrade_enabled parameter to true:
```
parameters:
  _param:
    openstack_upgrade_enabled: true
```
4. Refresh pillars on the dbs* nodes:
```
salt 'dbs*' saltutil.refresh_pillar
```
Add repositories with new Galera packages:
1. Apply the linux.system.repo state on the dbs* nodes:
```
salt 'dbs*' state.sls linux.system.repo
```
2. Update the /etc/salt/minion.d/_orchestration.conf file:
```
salt 'cfg*' state.sls salt.master
```
  Warning
  
  Verify that you have installed the latest Galera Salt formula and it is present in /etc/salt/minion.d/_orchestration.conf. The file should include the galera line.
Log in to the Jenkins web UI.
Open the Deploy - upgrade Galera cluster pipeline.

Specify the following parameters as required:

Deploy - upgrade Galera cluster pipeline parameters¶
Parameter	Description
`INTERACTIVE`	Specifies the mode to get the detailed description of the pipeline job flow through the stages.
`SHUTDOWN_CLUSTER`	Shuts down all MySQL instances on the target nodes during upgrade.
`OS_DIST_UPGRADE`	Upgrades system packages including kernel using `apt-get dist-upgrade`. Optional.
`OS_UPGRADE`	Upgrades all installed applications using `apt-get upgrade`. Optional.
`SALT_MASTER_CREDENTIALS`	Defines the Salt Master node credentials to use for connection, defaults to `salt`.
`SALT_MASTER_URL`	Defines the Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
`TARGET_SERVERS`	Adds the target database server nodes. Defaults to `dbs*`.

Click Deploy.

To monitor the deployment process, follow the instructions in MCP Deployment Guide: View the deployment details.

The Deploy - upgrade Galera cluster pipeline workflow¶
Stage	Description
Pre-upgrade	Only non-destructive actions are applied during this phase. Basic service verification is performed. The job is launched on all target servers before moving to the next stage.
Stop the Galera cluster	All Galera clusters are stopped on the `TARGET_SERVERS` nodes in the reverse order. For example, `dbs03`, `dbs02`, and then `dbs01`. OpenStack APIs are not accessible starting from this point. This does not affect the data plane services such as OVS or KVM nodes.
Upgrade Galera	The Galera code is upgraded. No workload downtime is expected.
Upgrade OS	Optional. Launches only if `OS_UPGRADE` or `OS_DIST_UPGRADE` is selected. A reboot can be performed if required. When the node is back online, the basic service checks are performed.
Start the Galera cluster	The Galera cluster is being started on the `TARGET_SERVERS` nodes starting with the last instance stooped. For example, the nodes are started in the following order: `dbs01`, `dbs02`, and `dbs03`.

Revert the changes in the classes/cluster/<cluster_name>/infra/init.yml file made during step 3.3.

Upgrade Galera manually¶

This section instructs you on how to manually upgrade the Galera cluster. Only the MySQL and Galera packages will be upgraded. The upgrade of an underlying operating system is out of scope.

During the upgrade, the Galera cluster remains alive while you shut down each MySQL service on the node one by one to upgrade its packages and then restart the service. When the node reconnects, it synchronizes with the cluster as in case of any other outage. The upgrade of an underlying operating system is out of scope.

Warning

Before performing the upgrade on a production environment:

Accomplish the procedure on a staging environment to determine the required maintenance window duration.
Schedule an appropriate maintenance window to reduce the load on the cluster.
Do not shut down the VMs or workloads as networking and storage functions are not affected.

To upgrade the Galera cluster:

Prepare the Galera cluster for the upgrade:
1. Verify that you have added the required repositories on the Galera nodes to download the updated MySQL and Galera packages.
2. Verify that your Galera cluster is up and running as described in Verify a Galera cluster status.
3. Create an instant backup of the MySQL database as described in Back up and restore a MySQL database.
Log in to the Salt Master node.

Obtain the new packages on all Galera nodes:

salt -C 'I@galera:*' cmd.run 'apt-get clean; apt-get update'

Verify that the MySQL and Galera packages are available on all Galera nodes:

salt -C 'I@galera:*' cmd.run "apt-cache policy mysql-wsrep-5.6 |egrep -i 'installed|candidate'"
salt -C 'I@galera:*' cmd.run "apt-cache policy galera-3 |egrep -i 'installed|candidate'"

Example of system response:

dbs02.openstack-ovs-core-ssl-pike-8602.local:
      Installed: 5.6.35-0.1~u16.04+mcp2
      Candidate: 5.6.41-1~u16.04+mcp1
dbs01.openstack-ovs-core-ssl-pike-8602.local:
      Installed: 5.6.35-0.1~u16.04+mcp2
      Candidate: 5.6.41-1~u16.04+mcp1
dbs03.openstack-ovs-core-ssl-pike-8602.local:
      Installed: 5.6.35-0.1~u16.04+mcp2
      Candidate: 5.6.41-1~u16.04+mcp1

Verify the runtime versions of the MySQL nodes of the Galera cluster:

salt -C 'I@galera:*' mysql.version

Example of system response:

dbs02.openstack-ovs-core-ssl-pike-8602.local:
    5.6.35-0.1~u16.04+mcp2
dbs01.openstack-ovs-core-ssl-pike-8602.local:
    5.6.35-0.1~u16.04+mcp2
dbs03.openstack-ovs-core-ssl-pike-8602.local:
    5.6.35-0.1~u16.04+mcp2

Perform the following steps on one of the Galera slave nodes, for example, on the third instance:

Stop the MySQL service:

salt -C 'I@galera:* and *03*' service.stop mysql

Upgrade the packages:

salt -C 'I@galera:* and *03*' cmd.run "apt-get -y install --reinstall -o \
DPkg::Options::=--force-confold -o Dpkg::Options::=--force-confdef mysql-wsrep-5.6 \
mysql-wsrep-common-5.6 mysql-wsrep-libmysqlclient18 galera-3"

Apply the galera Salt state:

salt -C 'I@galera:* and *03*' state.apply galera

Verify that your Galera cluster is up and running as described in Verify a Galera cluster status.

Perform the step 6 on the remaining Galera nodes one by one.
Verify the cluster status after upgrade:
1. Verify the versions of the installed packages:
```
salt -C 'I@galera:*' mysql.version
```
2. Verify that your Galera cluster is up and running as described in Verify a Galera cluster status.

Starting from the MCP 2019.2.18 maintenance update, you can upgrade your Galera cluster to version 5.7:

Upgrade Galera to v5.7 automatically¶

Note

This feature is available starting from the MCP 2019.2.18 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to automatically upgrade the Galera cluster to version 5.7 using the Deploy - upgrade Galera cluster Jenkins pipeline job.

To upgrade Galera to v5.7 automatically:

Verify that you have performed a backup of your database as described in Back up and restore a MySQL database.
Perform a MySQL dump to a file that can later be used as data source for manual restoration. Verify that the node has enough free space to make a dump file. Run the following command on a database server node (dbs):
```
mysqldump --defaults-file=/etc/mysql/debian.cnf -AR > %dump_file
```
Log in to the Salt Master node.
Open the cluster level of your deployment model.
Specify the new version for Galera packages:
1. In <cluster_name>/openstack/database/mysql_version.yml, create a new YAML file with the following content:
```
parameters:
  _param:
    galera_mysql_version: "5.7"
```
2. In <cluster_name>/openstack/database/master.yml and <cluster_name>/openstack/database/slave.yml, add the created file:
```
classes:
...
- cluster.<cluster_name>.openstack.database.mysql_version
```
3. Refresh pillars on the database nodes:
```
salt -C "I@galera:master or I@galera:slave" saltutil.refresh_pillar
```
4. Verify that the pillars of the database nodes have Galera version 5.7:
```
salt -C "I@galera:master or I@galera:slave" pillar.get galera:version:mysql
```
  Warning
  
  If the Galera version is not 5.7, resolve the issue before proceeding with the upgrade.

Add repositories with new Galera packages:

Apply the linux.system.repo state on the database nodes:

salt -C "I@galera:master or I@galera:slave" state.sls linux.system.repo

Verify the availability of the new MySQL packages:

salt -C "I@galera:master or I@galera:slave" cmd.run 'apt-cache policy mysql-wsrep-server-5.7 mysql-wsrep-5.7'

Verify the availability of the new percona-xtrabackup-24 packages:

salt -C "I@galera:master or I@galera:slave" cmd.run 'apt-cache policy percona-xtrabackup-24'

Verify that the salt-formula-galera version is 1.0+202202111257.6945afc or later:
```
dpkg -l |grep salt-formula-galera
```

Log in to the Jenkins web UI.
Open the Deploy - upgrade Galera cluster pipeline.

Specify the following parameters as required:

Deploy - upgrade Galera cluster pipeline parameters¶
Parameter	Description
`INTERACTIVE`	Specifies the mode to get the detailed description of the pipeline job flow through the stages.
`UPDATE_TO_MYSQL57`	Set this flag if you are upgrading MySQL from version 5.6 to 5.7.
`SHUTDOWN_CLUSTER`	Shuts down all MySQL instances on the target nodes during upgrade.
`OS_DIST_UPGRADE`	Optional. Upgrades system packages including kernel using `apt-get dist-upgrade`.
`OS_UPGRADE`	Optional. Upgrades all installed applications using `apt-get upgrade`.
`SALT_MASTER_CREDENTIALS`	Defines the Salt Master node credentials to use for connection, defaults to `salt`.
`SALT_MASTER_URL`	Defines the Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
`TARGET_SERVERS`	Adds the target database server nodes. Defaults to `dbs*`.

Warning

Verify that you have selected the UPDATE_TO_MYSQL57 and SHUTDOWN_CLUSTER check boxes.

Click Deploy.

To monitor the deployment process, follow the instructions in MCP Deployment Guide: View the deployment details.

The Deploy - upgrade Galera cluster pipeline workflow¶
Stage	Description
Pre-upgrade	Only non-destructive actions are applied during this phase. Basic service verification is performed. The job is launched on all target servers before moving to the next stage.
Stop the Galera cluster	All Galera clusters are stopped on the `TARGET_SERVERS` nodes in the reverse order. For example, `dbs03`, `dbs02`, and then `dbs01`. OpenStack APIs are not accessible starting from this point. This does not affect the data plane services such as OVS or KVM nodes.
Upgrade Galera	The Galera code is upgraded. No workload downtime is expected.
Upgrade OS	Optional. Launches only if `OS_UPGRADE` or `OS_DIST_UPGRADE` is selected. A reboot can be performed if required. When the node is back online, the basic service checks are performed.
Start the Galera cluster	The Galera cluster is being started on the `TARGET_SERVERS` nodes starting with the last instance stooped. For example, the nodes are started in the following order: `dbs01`, `dbs02`, and `dbs03`.

Upgrade Galera to v5.7 manually¶

Note

This feature is available starting from the MCP 2019.2.18 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to manually upgrade the Galera cluster to version 5.7. The upgrade of an underlying operating system is out of scope.

During the upgrade, the Galera cluster will be shut down and all nodes will be shut down one by one. Then package update and service start will be performed.

Warning

Before performing the upgrade on a production environment:

Accomplish the procedure on a staging environment to determine the required maintenance window duration.
Schedule an appropriate maintenance window to reduce the load on the cluster.
Do not shut down the VMs or workloads as networking and storage functions are not affected.

To upgrade Galera to v5.7 manually:

Prepare the Galera cluster for the upgrade:
1. Verify that you have added the required repositories on the Galera nodes to download the updated MySQL and Galera packages.
2. Verify that your Galera cluster is up and running as described in Verify a Galera cluster status.
3. Create an instant backup of the MySQL database as described in Back up and restore a MySQL database.
4. Perform a MySQL dump to a file that can later be used as data source for manual restoration. Verify that the node has enough free space to make a dump file. Run the following command on a database server node (dbs):
```
mysqldump --defaults-file=/etc/mysql/debian.cnf -AR > %dump_file
```
Log in to the Salt Master node.
Open the cluster level of your deployment model.
Specify the new version for Galera packages:
1. In <cluster_name>/openstack/database/mysql_version.yml, create a new YAML file with the following content:
```
parameters:
  _param:
    galera_mysql_version: "5.7"
```
2. In <cluster_name>/openstack/database/master.yml and <cluster_name>/openstack/database/slave.yml, add the created file:
```
classes:
...
- cluster.<cluster_name>.openstack.database.mysql_version
```
3. Refresh pillars on the database nodes:
```
salt -C "I@galera:master or I@galera:slave" saltutil.refresh_pillar
```
4. Verify that the pillars of the database nodes have Galera version 5.7:
```
salt -C "I@galera:master or I@galera:slave" pillar.get galera:version:mysql
```
  Warning
  
  If the Galera version is not 5.7, resolve the issue before proceeding with the upgrade.

Add repositories with new Galera packages:

Apply the linux.system.repo state on the database nodes:

salt -C "I@galera:master or I@galera:slave" state.sls linux.system.repo

Verify the availability of the new MySQL packages:

salt -C "I@galera:master or I@galera:slave" cmd.run 'apt-cache policy mysql-wsrep-server-5.7 mysql-wsrep-5.7'

Verify the availability of the new percona-xtrabackup-24 packages:

salt -C "I@galera:master or I@galera:slave" cmd.run 'apt-cache policy percona-xtrabackup-24'

Verify that the salt-formula-galera version is 1.0+202202111257.6945afc or later:
```
dpkg -l |grep salt-formula-galera
```

Verify the runtime versions of the MySQL nodes of the Galera cluster:

salt -C "I@galera:master or I@galera:slave" mysql.version

Example of system response:

dbs02.openstack-ovs-core-ssl-pike-8602.local:
    5.6.51-1~u16.04+mcp1
dbs01.openstack-ovs-core-ssl-pike-8602.local:
    5.6.51-1~u16.04+mcp1
dbs03.openstack-ovs-core-ssl-pike-8602.local:
    5.6.51-1~u16.04+mcp1

Perform the following steps on the Galera nodes one by one:
1. Stop the MySQL service on the node 3:
```
salt -C 'I@galera:slave and *03*' service.stop mysql
```
2. Stop the MySQL service on the node 2:
```
salt -C 'I@galera:slave and *02*' service.stop mysql
```
3. Stop the MySQL service on the master MySQL node:
```
salt -C 'I@galera:master and *01*' service.stop mysql
```
Perform the following steps on the MySQL master node that was stopped last:
1. Open /etc/mysql/my.cnf for editing.
2. Comment out the wsrep_cluster_address line:
```
...
#wsrep_cluster_address="gcomm://192.168.2.51,192.168.2.52,192.168.2.53"
...
```
3. Add the wsrep_cluster_address parameter without any IP address specified:
```
...
wsrep_cluster_address="gcomm://"
...
```

On the same node, upgrade the packages:

Obtain the MySQL root password:

salt -C "I@galera:master" config.get mysql:client:server:database:admin:password

Update the percona-xtrabackup package:

apt-get -y install percona-xtrabackup-24

Update the Galera packages:

apt-get -y install -o DPkg::Options::=--force-confold -o Dpkg::Options::=--force-confdef mysql-wsrep-5.7 mysql-wsrep-common-5.7 galera-3

In the window that opens, enter the root password obtained in the step above.
Restart the mysql service:
```
systemctl restart mysql
```

Verify the cluster status:

salt-call mysql.status | grep -A1 wsrep_cluster_size

Perform the step 9 on the remaining Galera nodes one by one.
On the node where you have changed the wsrep_cluster_address parameter, apply the galera state and restart the service:
```
salt -C "I@galera:master" state.apply galera
salt -C "I@galera:master" service.restart mysql
```

Verify that your Galera cluster is up and running:

salt -C 'I@galera:master' mysql.status | \
grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'

Example of system response:

wsrep_cluster_size:
  3

wsrep_incoming_addresses:
  192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306

wsrep_local_state_comment:
  Synced

Upgrade RabbitMQ¶

This section instructs you on how to upgrade the RabbitMQ component automatically through Jenkins using the Deploy - upgrade RabbitMQ server pipeline job.

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.
Starting from the MCP 2019.2.10 maintenance update, RabbitMQ version 3.8.2 is available. Prior to updating RabbitMQ to the new version, perform the steps described in Apply maintenance updates and the following prerequisite steps.
For OpenStack Pike environments with RabbitMQ version 3.8.2, the timeouts are set to rabbit_retry_interval = 5, rabbit_retry_backoff = 10, and kombu_reconnect_delay = 5.0 by default in the oslo.messaging package.

Caution

We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.

Prerequisites to upgrade RabbitMQ to version 3.8.2:

Log in to the Salt Master node.
Verify that the salt-formula-rabbitmq version is 0.2+202006030849.6079bf0~xenial1 or newer:
```
dpkg -l |grep salt-formula-rabbitmq
```
Verify that the salt-formula-oslo-templates version is 2018.1+202006191406.b3839c0~xenial1 or newer.
```
dpkg -l |grep salt-formula-oslo-templates
```
Perform the following steps depending on your OpenStack version:
- For OpenStack Queens, verify that the following OpenStack configuration variables are not lower than specified:
```
rabbit_retry_interval = 5
rabbit_retry_backoff = 10
kombu_reconnect_delay = 5.0
```
  For example, to verify the value of rabbit_retry_backoff across all nodes:
```
salt '*' cmd.run 'grep -r rabbit_retry_backoff /etc'
```
- For OpenStack Pike, verify that the python-oslo.messaging package version is 5.30.8-1~u16.04+mcp19 or newer:
```
salt '*' pkg.version 'python-oslo.messaging'
```

Verify that the RabbitMQ version 3.8.2 is available on the msg nodes:

salt -C 'I@rabbitmq:server' cmd.run 'apt-cache policy rabbitmq-server'

To upgrade RabbitMQ:

Prepare the Neutron server for the RabbitMQ upgrade:

Caution

This step is required since the Neutron service is sensitive to the RabbitMQ stability. The following Neutron configuration prevents massive resource rescheduling that can lead to the unbalanced load on the gateway nodes and other undesired consequences.
1. On each OpenStack controller node, modify the neutron.conf file as follows:
```
allow_automatic_dhcp_failover = false
allow_automatic_l3agent_failover = false
```
2. Restart the neutron-server service:
```
service neutron-server restart
```
For the large clusters with more than 50 nodes and more than 100 Open vSwitch ports per node, stop the Neutron Open vSwitch agents on each gateway and compute node. This prevents overloading of the Neutron servers with massive agent resyncs.
```
service neutron-openvswitch-agent stop
```
Log in to the Salt Master node.
Open the cluster level of your deployment model.
Include the RabbitMQ upgrade pipeline job to DriveTrain:
1. Add the following class to classes/cluster/<cluster_name>/cicd/control/leader.yml:
```
classes:
- system.jenkins.client.job.deploy.update.upgrade_rabbitmq
```
2. Apply the jenkins.client state on the Jenkins nodes:
```
salt -C 'I@jenkins:client' state.sls jenkins.client
```
3. Set the parameters in classes/cluster/<cluster_name>/infra/init.yml as follows:
```
parameters:
  _param:
    openstack_upgrade_enabled: true
```
4. Refresh pillars on the msg* nodes:
```
salt 'msg*' saltutil.refresh_pillar
```
Apply the linux.system.repo state on the msg* nodes to add repositories with new RabbitMQ packages:
```
salt 'msg*' state.sls linux.system.repo
```
Log in to the Jenkins web UI.
Open the Deploy - upgrade RabbitMQ server pipeline.

Specify the following parameters as required:

Deploy - upgrade RabbitMQ server pipeline parameters¶
Parameter	Description
`INTERACTIVE`	Mode to get the detailed description of the pipeline job flow through the stages
`OS_DIST_UPGRADE`	Upgrades system packages including kernel, aka `apt-get dist-upgrade`. Optional, launches only if `OS_DIST_UPGRADE` is selected.
`OS_UPGRADE`	Upgrades all installed applications, aka `apt-get upgrade`. Optional, launches only if `OS_UPGRADE` is selected.
`SALT_MASTER_CREDENTIALS`	Defines the Salt Master node credentials to use for connection, defaults to `salt`.
`SALT_MASTER_URL`	Defines the Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
`TARGET_SERVERS`	Adds the target RabbitMQ nodes. Defaults to `msg*`.

Click Deploy.

To monitor the deployment process, follow the instructions in MCP Deployment Guide: View the deployment details.

The pipeline Deploy - upgrade RabbitMQ server workflow¶
Stage	Description
Pre-upgrade	Only non-destructive actions are applied during this phase. Basic service verification is performed. The job is launched on all target servers before moving to the next stage.
Stop RabbitMQ service	All RabbitMQ services are stopped on the `TARGET_SERVERS` nodes in the reverse order. For instance, `msg03`, `msg02`, and then `msg01`. OpenStack APIs are not accessible starting from this point. This does not affect the data plane services such as OVS or KVM.
Upgrade RabbitMQ	RabbitMQ and Erlang code are upgraded. No workload downtime is expected.
Upgrade OS	Optional. Launches only if `OS_UPGRADE` or `OS_DIST_UPGRADE` is selected. A reboot can be performed if required. When the node is back online, the basic service checks are performed.
Start RabbitMQ service	All Rabbitmq services are running on the `TARGET_SERVERS` nodes starting with the last instance stooped. For example, `msg01`, `msg02`, and then `msg03`.

Starting from the 2019.2.10 update, apply the fluentd state to the msg nodes:
```
salt -C 'I@rabbitmq:server' state.apply fluentd
```
Revert the changes in the neutron.conf file made during step 1.

Gradually start the Neutron agents if you have stopped them during step 2:

service neutron-dhcp-agent start
service neutron-l3-agent start
service neutron-metadata-agent start
service neutron-openvswitch-agent start

Revert the changes in the classes/cluster/<cluster_name>/infra/init.yml file made during step 5.3.

Update or upgrade Kubernetes¶

Caution

This section describes how to automatically upgrade Kubernetes to a major version or update to a minor version. During this procedure, the downtime occurs only while the kube-apiserver VIP of a node to be updated is moving to another active node. The MCP cluster workload is not affected.

Caution

During the execution of the Kubernetes upgrade pipeline job:

The Kubernetes Docker container back end is replaced by containerd and all Kubernetes workloads are moved to containerd. However, the Docker service is not stopped and removed in the event some third-party Docker workloads that are not related to the MCP Kubernetes cluster can be running in Docker. If this is not the case and you do not need Docker anymore, you can disable it after the upgrade.

Starting from the MCP 2019.2.3 maintenance update, due to the conflict between the docker-engine and the containerd runc packages versions, Docker is removed during the upgrade to prevent the conflict. Therefore, Mirantis recommends migrating the third-party Docker workloads running on the MCP Kubernetes cluster, if any, before the upgrade.
If any DaemonSet was deployed on your cluster, all Kubernetes nodes will be rebooted to reflect the containerd changes to DaemonSets.
No third-party Docker workloads will be removed from Docker but any Docker workload will be stopped during a node reboot.

Starting from the MCP 2019.2.3 maintenance update, any third-party Docker workload running on the MCP Kubernetes cluster is stopped and removed along with Docker during a node reboot.

Automatically update or upgrade Kubernetes¶

This section describes how to update or upgrade your Kubernetes cluster including Calico and etcd [0] using the Jenkins Deploy - update Kubernetes cluster pipeline.

To update or upgrade Kubernetes using the Jenkins pipeline:

Log in to the Jenkins web UI.
Open the Deploy - update Kubernetes cluster pipeline.

Specify the following parameters:

Parameter	Description and values
ARTIFACTORY_URL	Optional. Required for conformance tests only. The artifactory URL where Docker images are located. Automatically taken from Reclass.
CALICO_UPGRADE_VERSION	Define the version of the calico-upgrade utility to use during the Calico upgrade. This option is only relevant if `UPGRADE_CALICO_V2_TO_V3` is selected.
CMP_TARGET	Add the target Kubernetes `cmp` nodes. For example, `'cmp* and I@kubernetes:pool'`.
CONFORMANCE_RUN_AFTER	Optional. Select to run the Kubernetes conformance tests after you upgrade a Kubernetes cluster.
CONFORMANCE_RUN_BEFORE	Optional. Select to run the Kubernetes conformance tests before you upgrade a Kubernetes cluster.
CTL_TARGET	Add the target Kubernetes `ctl` nodes. For example, `I@kubernetes:master`.
KUBERNETES_CALICO_CALICOCTL_SOURCE KUBERNETES_CALICO_CALICOCTL_SOURCE_HASH KUBERNETES_CALICO_CNI_SOURCE_HASH KUBERNETES_CALICO_BIRDCL_SOURCE KUBERNETES_CALICO_BIRDCL_SOURCE_HASH KUBERNETES_CALICO_CNI_IPAM_SOURCE KUBERNETES_CALICO_CNI_IPAM_SOURCE_HASH	Leave these fields empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version.
KUBERNETES_CALICO_CNI_SOURCE	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. For testing purposes, you can add a versioned `calico/cni` image to use in your deployments.
KUBERNETES_CALICO_IMAGE	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. For testing purposes, you can add a versioned `calico/node` image to use in your deployments.
KUBERNETES_CALICO_KUBE_CONTROLLERS_IMAGE	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. For testing purposes, you can add a versioned `calico/kube-controllers` image to use in your deployments.
KUBERNETES_ETCD_SOURCE [0]	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. For testing purposes, you can add a versioned binary of the etcd server to use in your deployment.
KUBERNETES_ETCD_SOURCE_HASH [0]	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. For testing purposes, you can add the checksum of the versioned etcd binary set in the KUBERNETES_ETCD_SOURCE field.
KUBERNETES_HYPERKUBE_SOURCE	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. For testing purposes, you can add a versioned image to update the Kubernetes Master nodes from. Also, verify that the `- cluster.overrides` class exists your cluster model. Otherwise, the `KUBERNETES_HYPERKUBE_SOURCE`, `KUBERNETES_PAUSE_IMAGE`, `KUBERNETES_CALICO_IMAGE`, `KUBERNETES_CALICO_CALICOCTL_SOURCE`, `KUBERNETES_CALICO_CNI_SOURCE`, and `KUBERNETES_CALICO_KUBE_CONTROLLERS_IMAGE` variables will not take effect.
KUBERNETES_HYPERKUBE_SOURCE_HASH	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version.
KUBERNETES_PAUSE_IMAGE	Leave this field empty since you have already updated your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. For testing purposes, you can add a versioned image to use in your deployments.
PER_NODE	Select to update or upgrade the target Kubernetes nodes one by one. The option is required for the draining and cordoning functionality. Recommended.
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
SIMPLE_UPGRADE	Select if you do not need to drain and cordon the nodes during the cluster upgrade. Not recommended.
TARGET_UPDATES	Add the comma-separated list of the Kubernetes nodes to update. Valid values are `ctl`, `cmp`. To update only the Kubernetes Nodes, for example, define `cmp` only.
TEST_K8S_API_SERVER	Optional. Required for conformance tests only. The IP address of a local Kubernetes API server for conformance tests.
UPGRADE_CALICO_V2_TO_V3	Select to upgrade Calico from version 2.6.x to the latest supported version (3.x). This option is required only for upgrade of Calico from version 2.6.x. To update the minor Calico version (for example, from 3.1.x to 3.3.x), do not select this option, since the regular Kubernetes update procedure already includes the update of Calico minor version. Caution The Calico upgrade process implies the Kubernetes services downtime for workloads operations, for example, workloads spawning and removing. The downtime is caused by the necessity of the etcd schema migration where the Calico endpoints data and other Calico configuration data is stored.

Click Deploy. To monitor the deployment process, follow the instruction in MCP Deployment Guide: View the deployment details.
^{Obsolete since 2019.2.3} If you do not have any third-party Docker workloads that run outside the MCP Kubernetes cluster, stop and disable Docker on reboot. Run the following commands on any Kubernetes Master node:
```
systemctl stop docker
systemctl disable docker
```
Note

If your MCP cluster version is 2019.2.3 or later, skip this step since Docker is removed during the upgrade.

The Deploy - update Kubernetes cluster pipeline workflow:

Note

While any Kubernetes node is being cordoned or drained, other cluster nodes run its services. Therefore, the workload is not affected.

Add the hyperkube images defined in the KUBERNETES_HYPERKUBE_SOURCE, KUBERNETES_HYPERKUBE_SOURCE_HASH, and KUBERNETES_PAUSE_IMAGE pipeline fields or defined on the reclass-system level of the Reclass model for the target Kubernetes ctl and cmp nodes during the MCP release version upgrade.
Add the calico images defined in the KUBERNETES_CALICO_IMAGE, KUBERNETES_CALICO_CALICOCTL_SOURCE, KUBERNETES_CALICO_CALICOCTL_SOURCE_HASH, KUBERNETES_CALICO_CNI_SOURCE, KUBERNETES_CALICO_CNI_SOURCE_HASH, KUBERNETES_CALICO_BIRDCL_SOURCE, KUBERNETES_CALICO_BIRDCL_SOURCE_HASH, KUBERNETES_CALICO_CNI_IPAM_SOURCE, KUBERNETES_CALICO_CNI_IPAM_SOURCE_HASH, KUBERNETES_CALICO_KUBE_CONTROLLERS_IMAGE pipeline job parameters or defined on the reclass-system level of the Reclass model for the target Kubernetes ctl and cmp nodes during the MCP release version upgrade.
If the UPGRADE_CALICO_V2_TO_V3 option is selected, perform the Calico upgrade:
1. Download the calico-upgrade utility according to the CALICO_UPGRADE_VERSION setting.
2. Verify that the Calico upgrade is possible (verify the current Calico version and perform a dry run of the data upgrade).
3. Verify the Calico policy setting.
4. Perform the Calico data upgrade and lock Calico.
5. Update the Calico configuration and restart corresponding services. Calico is not operating during this step, so a cluster experiences the workloads operations downtime.
6. Unlock Calico.
Upgrade etcd on the Kubernetes target ctl nodes one by one.
For the first Kubernetes target ctl node:
1. Cordon the node.
2. Drain the node.
3. Regenerate SSL certificates for the node services.
4. Start the containerd service.
5. Upgrade the hyperkube image.
6. Restart the node services (api-server, api-scheduler, controller-manager, kube-proxy).
7. Restart the kubelet service.
8. If any DaemonSet is deployed on a cluster:
  1. Force remove the DaemonSet pod.
  2. Reboot the node.
9. Uncordon the node.
Complete the previous step on the remaining target ctl nodes one by one.
Upgrade add-ons if any on the target ctl nodes.
For the first Kubernetes target cmp node:
1. Cordon the node.
2. Drain the node.
3. Regenerate SSL certificates for the node services.
4. Start the containerd service.
5. Upgrade the hyperkube image.
6. Restart the kubelet service.
7. If any DaemonSet is deployed on a cluster:
  1. Force remove the DaemonSet pod.
  2. Reboot the node.
8. Uncordon the node.
Complete the previous step on the remaining target cmp nodes one by one.
Verify the Calico cluster version and integrity.

[0]	(1, 2, 3) The etcd upgrade is added since the MCP 2019.2.3 maintenance update.

Manually upgrade Calico from version 2.6 to 3.3¶

This section describes the manual Calico upgrade procedure from major version 2.6 to 3.3. To simplify the upgrade process, use the automatic Calico upgrade procedure that is included into the Kubernetes upgrade pipeline. For details, see: Automatically update or upgrade Kubernetes.

Note

To update the minor Calico version (for example, from 3.1.x to 3.3.x), use the regular Kubernetes update procedure described in Update or upgrade Kubernetes.

The upgrade process implies the Calico-related services downtime for about 1-2 minutes on a virtual 5-node cluster. The downtime may vary depending on hardware and cluster configuration.

Caution

This upgrade procedure is applicable when MCP is upgraded from Build ID 2018.8.0 to a newer MCP release version.

MCP does not support the Calico upgrade path for the MCP Build IDs earlier than 2018.8.0.

The Kubernetes services downtime for workloads operations are caused by the necessity of the etcd schema migration where the Calico endpoints data and other configuration data is stored. Also, the calico-node and calico-kube-controllers components should have the same versions to operate properly, so there will be downtime while these components are being restarted.

To upgrade Calico from version 2.6 to 3.3:

Upgrade your MCP cluster to a newer Build ID as described in Upgrade DriveTrain to a newer release version. Once done, the version parameters and configuration files of the Calico components are updated automatically to the latest supported version.
Log in to any Kubernetes ctl node where etcd is running.

Migrate the etcd schema:

Download the Calico upgrade binary file:

wget https://github.com/projectcalico/calico-upgrade/releases/download/v1.0.5/calico-upgrade

Grant execute permissions to the binary file:
```
chmod +x ./calico-upgrade
```

Obtain the etcd endpoints:

salt-call pillar.get etcd:server:members

Export the etcd environment variables. For example:

export APIV1_ETCD_ENDPOINTS=https://10.70.2.101:4001,https://10.70.2.102:4001,https://10.70.2.103:4001
export APIV1_ETCD_CA_CERT_FILE=/var/lib/etcd/ca.pem
export APIV1_ETCD_CERT_FILE=/var/lib/etcd/etcd-client.crt
export APIV1_ETCD_KEY_FILE=/var/lib/etcd/etcd-client.key
export ETCD_ENDPOINTS=https://10.70.2.101:4001,https://10.70.2.102:4001,https://10.70.2.103:4001
export ETCD_CA_CERT_FILE=/var/lib/etcd/ca.pem
export ETCD_CERT_FILE=/var/lib/etcd/etcd-client.crt
export ETCD_KEY_FILE=/var/lib/etcd/etcd-client.key

Substitute APIV1_ETCD_ENDPOINTS and ETCD_ENDPOINTS with corresponding values.

Start the Calico upgrade:
```
./calico-upgrade start --no-prompts
```
Note

After executing this command, Calico pauses to avoid running into an incorrect data state.

Apply the new Calico configuration:

Update basic Calico components:

salt -C 'I@kubernetes:pool' state.sls kubernetes.pool

Resume Calico after the etcd schema migration. For example:

export APIV1_ETCD_ENDPOINTS=https://10.70.2.101:4001,https://10.70.2.102:4001,https://10.70.2.103:4001
export APIV1_ETCD_CA_CERT_FILE=/var/lib/etcd/ca.pem
export APIV1_ETCD_CERT_FILE=/var/lib/etcd/etcd-client.crt
export APIV1_ETCD_KEY_FILE=/var/lib/etcd/etcd-client.key
export ETCD_ENDPOINTS=https://10.70.2.101:4001,https://10.70.2.102:4001,https://10.70.2.103:4001
export ETCD_CA_CERT_FILE=/var/lib/etcd/ca.pem
export ETCD_CERT_FILE=/var/lib/etcd/etcd-client.crt
export ETCD_KEY_FILE=/var/lib/etcd/etcd-client.key
./calico-upgrade complete --no-prompts

Update the Kubernetes add-ons:

salt -C 'I@kubernetes:master' state.sls kubernetes.master.kube-addons
salt -C 'I@kubernetes:master' state.sls kubernetes exclude=kubernetes.master.setup
salt -C 'I@kubernetes:master' --subset 1 state.sls kubernetes.master.setup

Restart kubelet:

salt -C 'I@kubernetes:pool' service.restart kubelet

Verify the Kubernetes cluster consistency:

Verify the Calico version and Calico cluster consistency:

calicoctl version
calicoctl node status
calicoctl get ipPool

Verify that the Kubernetes objects are healthy and consistent:

kubectl get node -o wide
kubectl get pod -o wide --all-namespaces
kubectl get ep -o wide --all-namespaces
kubectl get svc -o wide --all-namespaces

Verify the connectivity using Netchecker:

kubectl get ep netchecker -n netchecker
curl {{netchecker_endpoint}}/api/v1/connectivity_check

See also

Calico project documentation

See also

Update Virtlet

Upgrade StackLight LMA to Build ID 2019.2.0¶

This section describes how to upgrade your Prometheus-based StackLight LMA from MCP Build ID 2018.11.0 to 2019.2.0. StackLight LMA does not have its own versioning schema within MCP and is versioned by MCP releases. To upgrade from older versions, see: MCP Release Compatibility Matrix: Supported upgrade paths.

Warning

During the upgrade, the existing monitoring services may be disrupted. Therefore, you must plan a maintenance window.

Prerequisites¶

Before you proceed with upgrading StackLight LMA, perform the following prerequisite steps:

Upgrade your MCP cluster as described in Upgrade DriveTrain to a newer release version.

Note

If you are performing a manitenance update, verify that you have updated your MCP cluster as described in Update DriveTrain instead.
Log in to the Salt Master node.
Verify that there are no uncommitted changes on the cluster and system levels of the Reclass model:
1. For the cluster level:
```
cd /srv/salt/reclass/classes/cluster/<cluster_name>/
git status
```
2. For the system level:
```
cd /srv/salt/reclass/classes/system
git status
```
Example of system response:
```
On branch <branch_name>
Your branch is up-to-date with 'origin/<branch_name>'.

nothing to commit, working tree clean
```
In case of any uncommitted changes, solve the issue before proceeding with the upgrade.

Verify that all Salt minions are available:

salt '*' test.ping

Example of system response:

cmp0.<cluster_name>:
    True
mon01.<cluster_name>:
    True
gtw01.<cluster_name>:
    True
mon02.<cluster_name>:
    True
prx01.<cluster_name>:
    True
cmp1.<cluster_name>:
    True
mon03.<cluster_name>:
    True
cfg01.<cluster_name>:
    True

If any VM is not available, do not proceed before resolving the issue.

Verify that no alerts are triggered in the Alertmanager web UI as described in Use the Alertmanager web UI.

Caution

If any alert is triggered, investigate the issue before proceeding with the upgrade. An alert may have a low severity and it is up to the system administrator to decide whether to proceed with the upgrade in this case.
Specify the elasticsearch_version: 6 and kibana_version: 6 parameters in the classes/cluster/<cluster_name>/stacklight/log.yml file of your Reclass model to upgrade Elasticsearch and Kibana from v5 to v6. Otherwise, Elasticsearch and Kibana will be upgraded to the latest minor stable versions.

Once done, proceed to Upgrade StackLight LMA using the Jenkins job.

Upgrade StackLight LMA using the Jenkins job¶

This section describes how to upgrade StackLight LMA using the Deploy - upgrade Stacklight Jenkins job.

Warning

Verify that you have completed the steps described in Prerequisites.

To upgrade StackLight LMA using the Jenkins job:

Log in to the Jenkins web UI.
Open the Deploy - upgrade Stacklight pipeline job.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_URL	The URL of Salt API.
SALT_MASTER_CREDENTIALS	Credentials for Salt API stored in Jenkins.
STAGE_UPGRADE_SYSTEM_PART	Select to upgrade the system part including Telegraf, Fluentd, and Prometheus Relay.
STAGE_UPGRADE_ES_KIBANA	Select to upgrade Elasticsearch and Kibana.
STAGE_UPGRADE_DOCKER_COMPONENTS	Select to upgrade the StackLight LMA components running in Docker Swarm.

Click Build.

Click Full stage view to track the upgrade process.

The following table contains the details of the upgrade stages:

The upgrade pipeline workflow¶
#	Stage	Details
1	Update grains and mines	Refreshes grains on all nodes by applying the `salt.minion.grains` state. Refreshes modules on all nodes by running the `saltutil.refresh_modules` command. Updates mines on all nodes by running the `mine.update` command.
2	Enable the Ceph Prometheus plugin	If Ceph is installed in the cluster, enables the Ceph Prometheus plugin by applying the `ceph.mgr` state on the Ceph Monitor nodes.
3	Upgrade system components	For each service including Telegraf, Fluentd, Prometheus Relay, libvirt-exporter, and jmx-exporter: Prepares nodes with installed component, refreshes pillars and applies the `linux.system.repo state`. Updates the packages to the latest versions. Applies the corresponding Salt states for the changes to take effect. Shows the statuses of the services after the upgrade.
4	Upgrade Elasticsearch and Kibana	For Elasticsearch: Stops the service on all `log` nodes. Upgrades the package to the newest version. Reloads the `systemd` configuration. Starts the service on all `log` nodes. Verifies that the Elasticsearch cluster status is `green`. In case of a major upgrade, transforms the indices for the new version. For Kibana: Stops the service on all `log` nodes. Upgrades the package to the newest version. Starts the service on all `log` nodes. Shows the status of the service after the upgrade. In case of a major upgrade, migrates Kibana to the new index.
5	Upgrade components running in Docker Swarm	Disables and removes the previous versions of monitoring services. Rebuilds the Prometheus configuration by applying the `prometheus` state on the `mon` nodes. Disables and removes the previous version of Grafana. Starts the monitoring services by applying the `docker` state on the `mon` nodes. Synchronizes Salt modules by applying the `saltutil.sync_all` state. Applies the `grafana.client` state to refresh the Grafana dashboards.

Note

You may also enable additional functionality, such as Alerta or the Gainsight integration as required. For details, see Add new features to an existing StackLight LMA deployment.

Once done, proceed to Verify StackLight LMA after upgrade.

Verify StackLight LMA after upgrade¶

Once you have upgraded StackLight LMA as described in Upgrade StackLight LMA to Build ID 2019.2.0, verify that the upgrade succeeded and all StackLight LMA components are up and running.

To verify StackLight LMA after upgrade:

Inspect the execution logs of the Salt states for any errors.
Verify the availability of the Prometheus, Prometheus long-term storage, Grafana, Kibana, and Alertmanager web UIs as described in the corresponding sections of StackLight LMA operations. If you have deployed Alerta, also verify the Alerta web UI.
Verify the Prometheus targets in the Prometheus web UI.
Verify the Alertmanager alerts in the Alertmanager web UI.
Verify the data availability:
1. Inspect the Main and Prometheus performances Grafana dashboards to verify that the data sources operate and metrics are available.
2. Inspect the dashboards of the Kibana web UI for the No results found error message occurrence.
3. In the Kibana web UI, filter and inspect the error messages, if any.

Upgrade Ceph¶

You can upgrade your existing Ceph cluster from Jewel to Luminous and from Luminous to Nautilus and roll back Ceph VMs and OSDs if the upgrade fails.

Warning

The upgrade of Ceph Luminous to Nautilus is supported starting from the 2019.2.10 maintenance update. If your Ceph version is Jewel, first upgrade to Ceph Luminous as described below.

Upgrade the Ceph cluster¶

This section describes how to upgrade an existing Ceph cluster from Jewel to Luminous and from Luminous to Nautilus. If your Ceph version is Jewel, you must first upgrade to Luminous before upgrading to Ceph Nautilus. The Ceph - upgrade pipeline contains several stages. Each node is upgraded separately and requires user input to verify if the status of the Ceph cluster is correct and if the upgrade of a Ceph node was successful. The upgrade procedure is performed on a node-by-node basis. In case of a failure, the user can immediately roll back each node.

Note

The following setup provides for the Ceph upgrade to a major version. To update the Ceph packages to the latest minor versions, follow Update Ceph.

Warning

Before you upgrade Ceph:

If Ceph is being upgraded as part of the MCP upgrade, verify that you have upgraded your MCP cluster as described in Upgrade DriveTrain to a newer release version.
Verify that you have configured the server and client roles for a Ceph backup as described in Create a backup schedule for Ceph nodes.
The upgrade of Ceph Luminous to Nautilus is supported starting from the 2019.2.10 maintenance update. Verify that you have performed the following steps:
1. Apply maintenance updates.
2. Enable the ceph-volume tool.
If you are upgrading Ceph from a version prior 14.2.20, verify that the ceph:common:config:mon:auth_allow_insecure_global_id_reclaim pillar is unset or set to true.

To upgrade the Ceph cluster:

Open your Git project repository with the Reclass model on the cluster level.

In ceph/init.yml, specify the ceph_version parameter as required:

To upgrade from Jewel to Luminous:
```
_param:
  ceph_version: luminous
```

To upgrade from Luminous to Nautilus:

_param:
  ceph_version: nautilus
  linux_system_repo_mcp_ceph_codename: ${_param:ceph_version}

Reconfigure package repositories:
- To upgrade from Jewel to Luminous, in infra/init.yml, specify the linux_system_repo_update_mcp_ceph_url parameter:
```
_param:
  linux_system_repo_update_mcp_ceph_url: ${_param:linux_system_repo_update_url}/ceph-luminous/
```
- To upgrade from Luminous to Nautilus, remove the obsolete package repository by deleting all includes of system.linux.system.repo.mcp.apt_mirantis.ceph from the cluster model.
  
  The default files that include this class are as follows:
  - ceph/common.yml
  - openstack/init.yml
  - infra/kvm.yml
  Also, remove this class from non-default files of your cluster model, if any.
In ceph/mon.yml, verify that the following line is present:
```
classes:
- system.ceph.mgr.cluster
```

Commit the changes to your local repository:

git add infra/init.yml
git add ceph/init.yml
git add ceph/mon.yml
git commit -m "updated repositories for Ceph upgrade"

Refresh Salt pillars:
```
salt '*' saltutil.refresh_pillar
```
Unset all flags and verify that the cluster is healthy.

Note

Proceeding with some flags set on the cluster may cause unexpected errors.
Log in to the Jenkins web UI.
Open the Ceph - upgrade pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
ADMIN_HOST	Add `cmn01*` as the Ceph cluster node with the admin keyring.
CLUSTER_FLAGS	Add a comma-separated list of flags to apply before and after the pipeline: The `sortbitwise,noout` flags are mandatory for the upgrade of Ceph Jewel to Luminous. The `noout` flag is mandatory for the upgrade of Ceph Luminous to Nautilus. Specify the flags unset in the step 7. The flags will be automatically unset at the end of the pipeline execution.
WAIT_FOR_HEALTHY	Verify that this parameter is selected as it enables the Ceph health check within the pipeline.
ORIGIN_RELEASE	Add the current Ceph release version.
TARGET_RELEASE	Add the required Ceph release version.
STAGE_UPGRADE_MON	Select to upgrade Ceph `mon` nodes.
STAGE_UPGRADE_MGR	Select to deploy new `mgr` services or upgrade the existing ones.
STAGE_UPGRADE_OSD	Select to upgrade Ceph `osd` nodes.
STAGE_UPGRADE_RGW	Select to upgrade Ceph `rgw` nodes.
STAGE_UPGRADE_CLIENT	Select to upgrade Ceph client nodes, such as `ctl`, `cmp`, and others.
STAGE_FINALIZE	Select to set the configurations recommended for `TARGET_RELEASE` as a final step of the upgrade.
BACKUP_ENABLED	Select to copy the disks of Ceph VMs before upgrade and to back up Ceph directories on OSD nodes. Note During the backup, virtual machines are consequently backed up one after another - each VM is destroyed, the disk is copied and then the VM is started again. After a VM launches, the backup procedure is paused until the VM joins the Ceph cluster again and only then it continues to back up the other node. On OSD nodes, only the `/etc/ceph` and `/var/lib/ceph/` directories are backed up. Mirantis recommends verifying that each OSD has been successfully upgraded before proceeding to the next one.
BACKUP_DIR ^{Added since 2019.2.4 update}	Optional. If BACKUP_ENABLED is selected, specify the target directory for the backup.

Click Deploy.

Warning

If the upgrade on the first node fails, stop the upgrade procedure and roll back the failed node as described in Roll back Ceph services.
In case of Ceph Luminous to Nautilus upgrade, run the Deploy - Upgrade Stacklight Jenkins pipeline job as described in Upgrade StackLight LMA using the Jenkins job to reconfigure StackLight for the new Ceph metrics.

Caution

The Jenkins pipeline job changes repositories, runs upgrade packages, and restarts the service on each node of selected groups. As the pipeline installs packages from configured repositories and does not verify the version to install, for some environments the version of Ceph packages can differ between nodes after upgrade. It can affect one node due to manual configuration without a model or an entire group, like Ceph OSD nodes, due to a misconfiguration in Reclass. It is possible to run a cluster with mismatching versions. However, such configuration is not supported and, with a specific version, may cause cluster outage.

The Jenkins pipeline job provides information about packages versions change in the console output for each node. Consider checking them before proceeding to the next node, especially on the first node of each component.

The Ceph - upgrade pipeline workflow:

Perform the backup.
Set upgrade flags.
Perform the following steps for each selected stage for each node separately:
1. Update Ceph repository.
2. Upgrade Ceph packages.
3. Restart Ceph services.
4. Execute the verification command.
5. Wait for user input to proceed.
Unset the upgrade flags.
Set ceph osd require-osd-release as TARGET_RELEASE.
Set ceph osd set-require-min-compat-client as ORIGIN_RELEASE.
Set CRUSH tunables to optimal.
If you enabled scrubbing and deep scrubbing before starting the upgrade, disable them specifying the ceph osd set noscrub and ceph osd set nodeep-scrub flags. Also, remove scrubbing settings if any.

Roll back Ceph services¶

You may need to roll back Ceph components to a previous version if the upgrade procedure fails. To successfully return Ceph to the previous version, you must manually roll back all Ceph components, including Ceph Monitor nodes, Ceph RADOS Gateway nodes, and Ceph OSD nodes. If the Ceph storage upgrade fails, you can roll back the Ceph VMs (the Monitor and RADOS Gateway nodes) and Ceph OSDs.

During the upgrade, the backed up Ceph VMs are copied to the root directory of the KVM nodes on which the Ceph nodes reside.

To roll back the Ceph VMs:

Log in to the KVM node.
Copy the backed up disks to the directory with libvirt. For example, to /var/lib/libvirt/images/cmn04.local/system.qcow2.
Start the instance:
```
virsh start <NODE_NAME>.<NODE_DOMAIN>
```

To roll back the Ceph OSDs:

Manually install Ceph packages of the previous version:

apt install <CEPH_PACKAGE_1>=<VERSION> <CEPH_PACKAGE_2>=<VERSION>

Restart services for all OSDs:
```
systemctl restart ceph-osd@<osd_num>
```

If the step 1 fails, perform the following steps:

Purge Ceph packages:

apt purge <CEPH_PACKAGE_1> <CEPH_PACKAGE_2>

Manually install Ceph packages of the previous version:

apt install <CEPH_PACKAGE_1>=<VERSION> <CEPH_PACKAGE_2>=<VERSION>

Follow the steps described in Restore the metadata of a Ceph OSD node. Alternatively, if restoring from a backup is not possible:

Purge Ceph packages:

apt purge <CEPH_PACKAGE_1> <CEPH_PACKAGE_2>

Verify that no Ceph devices are mounted:
```
umount <device_partition_path>
```

Run the following Salt states to redeploy the Ceph OSD node:

salt -C 'I@ceph:osd' state.sls ceph.osd
salt -C 'I@ceph:osd' saltutil.sync_grains
salt -C 'I@ceph:osd' state.sls ceph.osd.custom
salt -C 'I@ceph:osd' saltutil.sync_grains
salt -C 'I@ceph:osd' mine.update
salt -C 'I@ceph:setup' state.sls ceph.setup

See also

Update Ceph

Update an MCP cluster¶

Use the procedures in this section to apply minor updates to specific components of your MCP deployment with granular control requiring manual interventions.

Minor updates enable:

Delivering hot fixes to source code of OpenStack, Kubernetes, or MCP Control Plane
Updating packages for an OpenStack service
Updating the OpenContrail 4.x nodes
Applying security patches to operating system components and packages
Updating Virtlet to minor versions within the scope of a major release version
Updating Ceph packages to latest minor versions
Updating StackLight LMA

Note

To update a Kubernetes cluster, refer to Update or upgrade Kubernetes.

The sequence of steps to be completed for the MCP components during a maintenance update is described in the Apply maintenance updates section of the corresponding MCP maintenance update version in MCP Release Notes: Maintenance updates.

Update local mirrors¶

If you use local mirrors in the MCP cluster, you must update the local mirror VM before updating DriveTrain to a minor release version. Otherwise, skip this section and proceed with Update DriveTrain.

You can update local mirrors either manually or by replacing the existing local mirror VM with the latest version.

To replace the existing local mirror VM:

Warning

This procedure implies recreation of the apt01 VM. Therefore, all existing customizations applied to the local mirror VM will be lost. To keep the existing customizations, use the manual procedure below instead.

Log in to the Salt Master node.
Download the latest version of the prebuilt http://images.mirantis.com/mcp-offline-image-<BUILD-ID>.qcow2 image for the apt01 node from http://images.mirantis.com.
Copy the new image to /var/lib/libvirt/images/apt01/.
Power off the existing apt01 VM instance:
```
virsh shutdown apt01.<CLUSTER_DOMAIN>
```
If the local mirror VM is connected to the Salt Master node, temporarily remove it from the available minions:
```
salt-key
salt-key -d <local-mirror-node-name>
```
Note

The local mirror VM will be automatically connected back to the Salt Master node once the updated VM is deployed.
Deploy the apt01 VM instance with the new image as described in MCP Deployment Guide: Deploy the APT node.
Once the local mirror VM with the new image is up and running, verify that your current Reclass system model has the correct origin set:
```
cd /srv/salt/reclass/classes/system
git remote -v
```
Optional. You may set origin to your local mirror VM as it now has the latest update version of the Reclass system model:
```
git remote remove origin
git remote add origin http://<local_mirror_vm_ip>:8088/reclass-system.git
```

For an OpenContrail-based MCP cluster with the Build ID 2019.2.5 or earlier:

Verify whether the MCP cluster uses internal_proxy:
```
salt -C 'I@docker:host' pillar.get docker:host:proxy:enabled
```
Depending on the system output, add the corresponding pillar in the next step.

Add the following pillar to /opencontrail/control.yml and opencontrail/analytics.yml:

For MCP clusters with internal_proxy enabled:

parameters:
  ...
  docker:
    host:
      proxy:
        enabled: true
        http: ${_param:http_proxy}
        https: ${_param:http_proxy}
        no_proxy: ${linux:system:proxy:noproxy}
      insecure_registries:
        - ${_param:aptly_server_hostname}:5000

For MCP clusters with internal_proxy disabled:

parameters:
  ...
  docker:
    host:
      insecure_registries:
        - ${_param:aptly_server_hostname}:5000

Commit the changes to your local repository.

Apply the changes:

salt -C 'I@opencontrail:database' state.sls docker.host

Now, you can proceed with the step 2 of the Update DriveTrain procedure.

To update the local mirror VM manually:

Note

The procedure below requires access to the Internet.

Log in to the Salt Master node.
Verify the availability of the local mirror VM. For example:
```
salt 'apt01.local-deployment.local' test.ping
```
If the VM does not respond, enable the management of the offline mirror VM through the Salt Master node as described in MCP Deployment Guide: Enable the APT node management in the Reclass model.
Update the system level of the Reclass model. For example, using the Git submodule:

Note

If the Reclass system model has git origin set to the internal mirror repository, update it first.
```
cd /srv/salt/reclass/ && git submodule foreach git fetch
cd /srv/salt/reclass/classes/system && git checkout origin/release/2019.2.0
```
Open the cluster level of your Reclass model.

In /infra/mirror/init.yml:

Add the Git server parameters:

git:
  server:
    directory: /srv/git/
    repos: ${_param:default_local_mirrror_content:git_server_repos}

Include the debmirror content class:
```
- system.debmirror.mirror_mirantis_com
```

Include the Docker registry class:

docker:
  client:
    registry:
      target_registry: ${_param:default_local_mirrror_content:docker_client_registry_target_registry}
      image: ${_param:default_local_mirrror_content:docker_client_registry_image}

Include the static files class:

linux:
  system:
    file: ${_param:default_local_mirrror_content:linux_system_file}

Include the MAAS mirror class:

maas:
  mirror:
    enabled: true
    image:
      sections: ${_param:default_local_mirrror_content:maas_mirror_image_sections}

For an OpenContrail-based MCP cluster with the Build ID 2019.2.5 or earlier:

Verify whether the MCP cluster uses internal_proxy:
```
salt -C 'I@docker:host' pillar.get docker:host:proxy:enabled
```
Depending on the system output, add the corresponding pillar in the next step.

Add the following pillar to /opencontrail/control.yml and opencontrail/analytics.yml:

For the MCP clusters with internal_proxy enabled:

parameters:
  ...
  docker:
    host:
      proxy:
        enabled: true
        http: ${_param:http_proxy}
        https: ${_param:http_proxy}
        no_proxy: ${linux:system:proxy:noproxy}
      insecure_registries:
        - ${_param:aptly_server_hostname}:5000

For the MCP clusters with internal_proxy disabled:

parameters:
  ...
  docker:
    host:
      insecure_registries:
        - ${_param:aptly_server_hostname}:5000

Commit the changes to your local repository.

Synchronize the Salt modules:

salt '<offline_node_name>' saltutil.sync_all

Apply the following states:

salt '<offline_node_name>' state.sls git.server
salt '<offline_node_name>' state.sls debmirror
salt '<offline_node_name>' state.sls docker.client.registry
salt '<offline_node_name>' state.sls linux.system.file
salt '<offline_node_name>' state.sls maas.mirror

For an OpenContrail-based MCP cluster, apply the changes made in the step 6:
```
salt -C 'I@opencontrail:database' state.sls docker.host
```

Now, you can proceed with the step 2 of the Update DriveTrain procedure.

Verify DriveTrain¶

Note

This feature is available starting from the MCP 2019.2.17 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Using the Deploy - pre upgrade verify MCP Drivetrain Jenkins pipeline job, you can verify that your deployment model contains information about the necessary fixes and workarounds. Use the Jenkins pipeline job starting from the 2019.2.18 maintenance update.

To verify the deployment model before update:

Synchronize the upgrade pipeline jobs with the default parameters:
1. Log in to the Jenkins web UI.
2. Run the following Jenkins pipeline jobs with default parameters:
  - git-mirror-downstream-mk-pipelines
  - git-mirror-downstream-pipeline-library
Switch mcp-ci/pipeline-library to the version you are planning to update to. In branch value refs/tags/2019.2.X, substitute X with the desired maintenance update version. For example, if you want to apply maintenance update 2019.2.18, use the following command:
```
salt -C "I@jenkins:client:lib" state.sls jenkins.client.lib pillar='{"jenkins": {"client": {"lib" : {"pipeline-library" : {"branch": "refs/tags/2019.2.18"}}}}}'
```
In the global view, find the Deploy - pre upgrade verify MCP Drivetrain pipeline.
Select the Build with Parameters option from the drop-down menu of the pipeline.

Specify the following parameters:

**Deploy - pre upgrade verify MCP Drivetrain parameters**¶
Parameter	Description and values
MK_PIPELINES_REFSPEC	Set to the desired maintenance update version. For example, if your version is 2019.2.17 and you want to verify 2019.2.18, set MK_PIPELINES_REFSPEC to 2019.2.18.
PIPELINE_TIMEOUT	The time for the Jenkins job to complete, set to 1 hour by default. If the time exceeds, the Jenkins job will be aborted.
SALT_MASTER_CREDENTIALS	Jenkins credentials to access the Salt Master API.
SALT_MASTER_URL	The URL of the Salt Master node API.

Click Build.
Once the job finishes, open report.html in the job artifacts. Review the report and fix issues, if any.

Update DriveTrain¶

Within the scope of a major MCP release version, maintenance updates containing enhancements to the existing features, proactive security and critical bug fixes are being released with minor release versions. For details, see: MCP Release Notes: Maintenance updates. Before applying maintenance updates to an MCP cluster, the DriveTrain update is required.

To update DriveTrain to a minor release version:

For MCP clusters with local mirrors enabled, Update local mirrors.
If you are applying maintenance update 2019.2.15 or later, synchronize the upgrade pipeline jobs with the default parameters:
1. Log in to the Jenkins web UI.
2. Run the following Jenkins pipeline jobs with default parameters:
  - git-mirror-downstream-mk-pipelines
  - git-mirror-downstream-pipeline-library
Switch mcp-ci/pipeline-library to the version you are planning to update to. In the branch value refs/tags/2019.2.X, substitute X with the desired maintenance update version. For example, if you are applying maintenance update 2019.2.17, use the following command:
```
salt -C "I@jenkins:client:lib" state.sls jenkins.client.lib pillar='{"jenkins": {"client": {"lib" : {"pipeline-library" : {"branch": "refs/tags/2019.2.17"}}}}}'
```
Follow the Upgrade DriveTrain to a newer release version procedure and the same Deploy - upgrade MCP DriveTrain Jenkins pipeline job:
- If you are applying maintenance updates starting from version 2019.2.2 to 2019.2.8, consider specifying the following parameters in the DRIVE_TRAIN_PARAMS field. For later versions, the parameters are predefined and set to false by default.
  - OS_DIST_UPGRADE - do not set to true unless you want the system packages including kernel to be upgraded using apt-get dist-upgrade.
  - OS_UPGRADE - do not set to true unless you want all installed applications to be upgraded using apt-get upgrade.
- If you are applying maintenance updates from any version to 2019.2.8 or later, select APPLY_MODEL_WORKAROUNDS to apply the cluster model workarounds automatically unless you manually added some patches to the model before the update.
- Set the TARGET_MCP_VERSION parameter specifying the current MCP release version. For example, if your current MCP version is 2019.2.0, specify 2019.2.0.
- Starting from 2019.2.8, alternatively, set the TARGET_MCP_VERSION, MK_PIPELINES_REFSPEC, and GIT_REFSPEC parameters to the desired maintenance update version. For example, to update from 2019.2.7 to 2019.2.8, specify 2019.2.8.
Warning

Starting from the maintenance update 2019.2.15, Jenkins update is run asynchronously. After you update DriveTrain, wait until Jenkins is updated and restarted several times. To verify that Jenkins update is finished, run the salt-run jobs.active command from the Salt Master node. Consider the update finished if the output does not include an active task named jenkins.client.
Optional. Upgrade system packages on the Salt Master node as described in the step 16 of the Upgrade to MCP release version 2019.2.0 procedure.
Update GlusterFS.

Once you update DriveTrain, proceed with applying maintenance updates to your MCP cluster components as required. For details, see: Update an MCP cluster.

Update GlusterFS¶

If you do not have any services that run on top of the GlusterFS volumes except the Docker Swarm services such as Jenkins, Gerrit, LDAP, you can use the Jenkins Update GlusterFS pipeline job to automatically update GlusterFS to the latest version. This pipeline job consequently executes the following GlusetFS update dedicated pipeline jobs with the default parameters:

Update glusterfs servers
Update glusterfs clients
Update glusterfs cluster.op-version

For the procedure details and description of the pipeline jobs workflow, see Upgrade GlusterFS.

Caution

Before you execute the Update GlusterFS pipeline job, complete steps 1-3 of the Upgrade GlusterFS procedure.

If you have any services that run on top of the GlusterFS volumes except the Docker Swarm services, Mirantis recommends updating the GlusterFS components separately using the dedicated pipeline jobs as described in Upgrade GlusterFS.

Update OpenStack packages¶

This section provides the reference information to consider when updating the OpenStack packages. Use the descriptive analysis of the techniques and tools, as well as the high-level upgrade flow included in this section to create a cloud-specific detailed update procedure, assess the risks, plan the rollback, backup, and testing activities.

Caution

We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.

OpenStack update vs OpenStack upgrade¶

This section explains the differences between the OpenStack to a newer major release upgrade and OpenStack packages update.

The main purpose of update is to provide minor updates for the OpenStack packages without changing the major versions of packages.

**Upgrade vs update**¶
	Upgrade	Update
Package versions	Between the package versions supported by the sequent major OpenStack releases. For example, upgrade of Nova v15.0.1 (included in Ocata) to v16.0.1 (included in Pike).	Between the package versions within a single major OpenStack release. For example, update of Nova v15.0.1 (included in Ocata) to v15.2.0 (included in Ocata).
Database syncs	Performed and required	Can be performed but are not required
Control plane downtime	Required and expected	Not expected as the control plane nodes will be updated one by one

Limitations¶

The following are the limitations of the OpenStack upgrade pipelines that are used for the OpenStack packages update as well:

The pipelines update only the OpenStack component. The update of other VCP components such as msg and dbs nodes is out of scope and should be done separately.
The update of StackLight LMA is out of scope.

Prerequisites¶

Before you proceed with the OpenStack packages update, verify the following:

Upgrade of MCP is done to the latest Build ID. Verify that you have updated DriveTrain including Aptly, Gerrit, Jenkins, Reclass, Salt formulas, and their subcomponents to the current MCP release version. Otherwise, the current MCP product documentation is not applicable to your MCP deployment.
All OpenStack formulas states such as nova, neutron, and so on can be launched without errors.
(Optional) Online dbsyncs for services are performed before the update maintenance window since this task can take a significant amount of time.
No failed OpenStack services or nodes are present in the cloud.
Utilization of the disk space is up to 80% on each target node, which include the ctl*, prx*, gtw*, and cmp* nodes.
There is enough disk space on the node that will store the backups for the MySQL databases and the existing cluster model.

Update the OpenStack packages¶

Generally, the OpenStack upgrade and update procedures have common stages that include pre-upgrade/update, the control plane and data plane upgrade/update, and post-upgrade/update. Moreover, during both upgrading and updating, the same tooling is used. See OpenStack upgrade tools overview for details.

OpenStack packages update stages¶
#	Stage	Description
1	Planning	Includes the creation of the maintenance plan.
2	Pre-update	Includes procedures that do not affect workability of the current OpenStack cluster state such as running the QA cycle, verifying infrastructure, configuring monitoring of workloads and services. Also, during this stage, the backups are created and additional services, servers, and systems are installed to facilitate the update process according to the plan.
3	Update	The actual update.
3.1	Control plane update	The control plane nodes are being updated. It should not have an impact on the data plane, and compatibility issues between the data plane and control plane of different minor versions are not expected. No downtime of the control plane is expected, as update is performed one by one for the control plane nodes which are in HA.
3.2	Data plane update	The data plane nodes that host the end-user data applications including the compute, storage, and gateway nodes are being updated.
4	Post update	Includes procedures that will not affect workability, such as post-update testing activities, and cleanup.

Warning

Before you perform the update on a production environment, accomplish the procedure on a staging environment. If the staging environment does not exist, adapt the exact cluster model and launch it inside the cloud as a heat stack, which will act as a staging environment.

Plan the OpenStack update¶

As a result of the planning stage of the OpenStack update, a detailed maintenance plan is created.

The maintenance plan must include the following parts:

A strict step-by-step update procedure
A rollback plan
A maintenance window schedule for each update phase

Note

The update flow is thoroughly selected by engineers in correspondence with the workload and requirements of a particular cloud.

After the maintenance plan is successfully tested on a staging environment, you can proceed with the actual update in production.

Perform the pre-update activities¶

The pre-update stage includes the activities that do not affect workability of a currently running OpenStack version as well as the backups creation.

To prepare your OpenStack deployment for the update:

(Optional) On one of the controller nodes, perform the online database migrations for the following services:

Note

The database migrations can be time-consuming and create high load on CPU and RAM. We recommend that you perform the migrations in batches.
- Nova:
```
nova-manage db online_data_migrations
```
- Cinder:
```
cinder-manage db online_data_migrations
```
- Ironic:
```
ironic-dbsync online_data_migrations
```

Prepare the target nodes for the update:

Get the list of all updatable OpenStack components:

salt-call config.get orchestration:upgrade:applications --out=json

Example of system response:

{
    "<model_of_salt_master_name>": {
        "nova": {
            "priority": 1100
        },
        "heat": {
            "priority": 1250
        },
        "keystone": {
            "priority": 1000
        },
        "horizon": {
            "priority": 1800
        },
        "cinder": {
            "priority": 1200
        },
        "glance": {
            "priority": 1050
        },
        "neutron": {
            "priority": 1150
        },
        "designate": {
            "priority": 1300
        }
    }
}

Range the components from the output by priority. For example:
```
keystone
glance
nova
neutron
cinder
heat
designate
horizon
```
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters.

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master.

For each target node, get the list of installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of updatable OpenStack components with the lists of installed applications for each target node.
During update, the applications running on the target nodes use the KeystoneRC metadata. To guarantee that the KeystoneRC metadata is exported to mine, verify that you apply the keystone.upgrade.pre formula to the keystone:client:enabled node:
```
salt -C 'I@keystone:client:enabled' state.sls keystone.upgrade.pre
```

Apply the following states to each target node for each installed application in the strict order of priority:

salt <node_name> state.apply <component_name>.upgrade.pre
salt <node_name> state.apply <component_name>.upgrade.verify

For example, for Nova installed on the cmp01 compute node, run:

salt cmp01 state.apply nova.upgrade.pre
salt cmp01 state.apply nova.upgrade.verify

Note

On the clouds of medium and large sizes, you may want to automate this step. Use the following script as an example of possible automatization.

 #!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.pre
      salt $node state.apply $component.upgrade.verify
    fi
  done
done

Add the testing workloads to each compute host and monitoring and verify the following:
- The cloud services are monitored as expected.
- There are free resources (disk, RAM, CPU) on the kvm, ctl, cmp, and other nodes.
(Optional) Back up the OpenStack databases as described in Back up and restore a MySQL database.

Adjust the cluster model:

Include the upgrade pipeline job to DriveTrain:
1. Add the following lines to cluster/cicd/control/leader.yml:
  
  Caution
  
  If your MCP OpenStack deployment includes the OpenContrail component, do not specify the system.jenkins.client.job.deploy.update.upgrade_ovs_gateway class.
```
classes:
 - system.jenkins.client.job.deploy.update.upgrade
 - system.jenkins.client.job.deploy.update.upgrade_ovs_gateway
 - system.jenkins.client.job.deploy.update.upgrade_compute
```
2. Apply the jenkins.client state on the Jenkins nodes:
```
salt -C 'I@jenkins:client' state.sls jenkins.client
```
Set the parameters in classes/cluster/<cluster_name>/infra/init.yml as follows:
```
parameters:
  _param:
    openstack_upgrade_enabled: true
```

<app>:
  upgrade:
    enabled: ${_param:openstack_upgrade_enabled}

Note

On the clouds of medium and large sizes, you may want to automate this step. To obtain the list of the OpenStack applications running on a node, use the following script.

#!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  node_openstack_app=" "
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      node_openstack_app="$node_openstack_app $component"
    fi
  done
  echo "$node : $node_openstack_app"
done

Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```

Prepare the target nodes for the update:

Get the list of all updatable OpenStack components:

salt-call config.get orchestration:upgrade:applications --out=json

Range the components from the output by priority.
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master

For each target node, get the list of installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of updatable OpenStack components with the lists of installed applications for each target node.
Apply the following states to each target node for each installed application in strict order of priority:
```
salt <node_name> state.apply <component_name>.upgrade.pre
```

Note

On the clouds of medium and large sizes, you may want to automate this step. Use the following script as an example of possible automatization.

 #!/bin/bash
#List of formulas that implements upgrade API
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.pre
    fi
  done
done

Apply the linux.system.repo state on the target nodes:
```
salt <node_name> state.apply linux.system.repo
```
Proceed to Update the OpenStack control plane.

Update the OpenStack control plane¶

The OpenStack control plane update stage includes updating of the OpenStack services APIs.

Caution

Perform the update of the VCP nodes one by one to avoid downtime of the control plane.

To update the OpenStack VCP:

Log in to the Jenkins web UI.
Perform the update of the OpenStack controller nodes one by one. For each OpenStack controller node, run the Deploy - upgrade control VMs pipeline in the interactive mode setting the parameters as follows:
- TARGET_SERVERS=’<full_ctl_node_name>’ where full_ctl_node_name can be, for example, ctl01.domain.local
- MODE=INTERACTIVE mode to get the detailed description of the pipeline flow through the stages
- OS_DIST_UPGRADE - do not select unless you want the system packages including kernel to be upgraded using apt-get dist-upgrade. Deselected by default.
- OS_UPGRADE - do not select unless you want all installed applications to be upgraded using apt-get upgrade. Deselected by default.
Verify that the VCP is up and the OpenStack services from the data plane are reconnected and working correctly with the newly updated OpenStack controller nodes. For example, to verify the OpenStack services after updating the ctl01 node, apply the following states:
```
salt 'ctl01*' state.apply heat.upgrade.verify
salt 'ctl01*' state.apply nova.upgrade.verify
salt 'ctl01*' state.apply cinder.upgrade.verify
salt 'ctl01*' state.apply glance.upgrade.verify
salt 'ctl01*' state.apply neutron.upgrade.verify
```
For details on how to get the list of updatable OpenStack components and verify their status on other nodes, see Perform the pre-update activities.
Run the Deploy - upgrade control VMs on each of the proxy nodes one by one setting TARGET_SERVERS=’<full_prx_node_name>’, where full_prx_node_name> can be, for example, prx01.domain.local.
Verify that the public API is accessible and Horizon is working.
Perform the update of other VCP nodes where required depending on your deployment by running the Deploy - upgrade control VMs pipeline in the interactive mode setting the parameters as follows:
- TARGET_SERVERS:
  - TARGET_SERVERS=share* to upgrade the Manila control plane
  - TARGET_SERVERS=mdb* to upgrade the Tenant Telemetry including Ceilometer, Gnocchi, Aodh, and Panko
  - TARGET_SERVERS=kmn* to upgrade Barbican
  - TARGET_SERVERS=dns* to upgrade Designate
- MODE=INTERACTIVE mode to get the detailed description of the pipeline flow through the stages
- OS_DIST_UPGRADE - do not select unless you want the system packages including kernel to be upgraded using apt-get dist-upgrade. Deselected by default.
- OS_UPGRADE - do not select unless you want all installed applications to be upgraded using apt-get upgrade. Deselected by default.
Verify that the control plane is up and the OpenStack services from the data plane are reconnected and working correctly with the newly updated control plane.
From an OpenStack controller node, clean up the old heat-engine records:
1. Verify that all Heat engines have started and the number of active Heat engines matches the number of OpenStack controller nodes multiplied by the number of Heat engine workers per node:
```
heat-manage service list
```
2. Remove all old, stopped engines:
```
heat-manage service clean
```
Verify that the OpenStack packages have been updated as expected. For example, to verify the packages after updating the ctl01 node with OS_UPGRADE selected, run the following command:
```
salt 'ctl01*' cmd.run 'apt-get -s -V -u upgrade'
```
To verify the packages after updating the ctl01 node with OS_DIST_UPGRADE selected:
```
salt 'ctl01*' cmd.run 'apt-get -s -V -u dist-upgrade'
```
Proceed to Update the OpenStack data plane.

Update the OpenStack data plane¶

The OpenStack data plane includes the servers that host end-user data applications. More specifically, these hosts include compute, storage, and gateway nodes. Depending on the update requirements, you can apply any kind of the update depths while updating the data plane.

To update the data plane of your OpenStack deployment:

Update the gateway nodes depending on your use case:

Caution

Skip this step if your MCP OpenStack deployment includes the OpenContrail component because such configuration does not contain gateway nodes.
- If non-HA routers are present in the cloud:
  1. Migrate the non-HA routers from the target nodes using the Neutron service and the following commands in particular:
    - l3-agent-router-add
    - l3-agent-router-remove
    - router-list-on-l3-agent
  2. Log in to the Jenkins web UI.
  3. Run the Deploy - upgrade OVS gateway pipeline for the gateway nodes which you have migrated the workloads from.
    
    Note
    
    Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
    
    Note
    
    Since all resources have already been migrated from the nodes, we recommend performing the full upgrade including OS_UPGRADE and OS_DIST_UPGRADE.
  4. Migrate the non-HA routers back and rerun the Deploy - upgrade OVS gateway pipeline for the rest of the gateway nodes.
- If non-HA routers are not present in the cloud:
  1. Log in to the Jenkins web UI.
  2. Run the Deploy - upgrade OVS gateway pipeline for all gateway nodes specifying TARGET_SERVERS=’gtw*’.
    
    Note
    
    Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
Verify that the gateway components are reconnected to the control plane.

Caution

Skip this step if your MCP OpenStack deployment includes the OpenContrail component because such configuration does not contain gateway nodes.
Update the OpenStack compute nodes.
1. Estimate and minimize the risks and address the limitations of the live migration.
  
  The limitations of the live migration technology include:
  
  Warning
  
  Before proceeding with the live migration in a production environment, assess these risks thoroughly.
  - The CPU of a source compute node must have a feature set that is a subset of a feature set of the target compute CPU. Therefore, the migration should be performed between the compute nodes with identical CPUs with, preferably, identical microcode versions.
  - During the live migration, the entire memory state of a VM must be copied to another server. In the first place, the memory pages that are being changed at a slower rate are copied. After, the system copies the most active memory pages. If the number of pages that are being written to is big, the migration process will never finish. High-memory, high-load Windows virtual machines are known to have this particular issue.
  - During the live migration, a very short downtime (1-2 seconds max) occurs. The reason for the downtime is that when the memory is copied, the execution context (vCPU state) has to be copied as well, and the execution itself must be switched to a new virtual machine. In addition to a short downtime, this causes a short clock lag on the migrated virtual machine. Therefore, if the migrated machine is hosting a part of a clustered service or system, the downtime and resulting time lag may have an adverse impact on the whole system.
  - The QEMU version installed on the source and target hosts should be the same and later than 2.5.
2. Perform the live migration of workloads.
3. Log in to the Jenkins web UI.
4. Run the Deploy - upgrade computes pipeline to update the OpenStack compute nodes which you have migrated the workloads from. It is essential that you update the compute nodes by small batches.
  
  Caution
  
  The impact of the update process should be calculated for each compute node during the planning stage as this step may take a significant amount of time.
  
  Note
  
  Run the pipeline in the interactive mode to get the detailed description of the pipeline flow through the stages.
5. Migrate the workloads back and rerun the Deploy - upgrade computes pipeline for the rest of the compute nodes.
Verify that the compute nodes are reconnected to the control plane.
Proceed to Perform the post-update activities.

Perform the post-update activities¶

The post-update activities include the post-update testing cycle and cleanup.

To finalize the update:

Perform the full verification cycle of your MCP OpenStack deployment.
Verify that the following variables are set in the classes/cluster/<cluster_name>/infra/init.yml file:
```
parameters:
  _param:
    openstack_upgrade_enabled: false
```
Refresh pillars:
```
salt '*' saltutil.refresh_pillar
```
Remove the test workloads/monitoring.

Remove the update leftovers that were created by applying the <app>.upgrade.post state:

Get the list of all updatable OpenStack components. For example:

salt-call config.get orchestration:upgrade:applications --out=json

Example of system response:

{
    "<model_of_salt_master_name>": {
        "nova": {
            "priority": 1100
        },
        "heat": {
            "priority": 1250
        },
        "keystone": {
            "priority": 1000
        },
        "horizon": {
            "priority": 1800
        },
        "cinder": {
            "priority": 1200
        },
        "glance": {
            "priority": 1050
        },
        "neutron": {
            "priority": 1150
        },
        "designate": {
            "priority": 1300
        }
    }
}

Range the components from the output by priority. For example:
```
keystone
glance
nova
neutron
cinder
heat
designate
horizon
```
Get the list of all target nodes:
```
salt-key | grep $cluster_domain | \
grep -v $salt_master_hostname | tr '\n' ' '
```
The cluster_domain variable stands for the name of the domain used as part of the cluster FQDN. For details, see MCP Deployment guide: General deployment parameters: Basic deployment parameters.

The salt_master_hostname variable stands for the hostname of the Salt Master node and is cfg01 by default. For details, see MCP Deployment guide: Infrastructure related parameters: Salt Master.

For each target node, get the list of installed applications:

salt <node_name> pillar.items __reclass__:applications --out=json

Match the lists of the updatable OpenStack components with the lists of installed applications for each target node.
Apply the following states to each target node for each installed application in the strict order of priority:
```
salt <node_name> state.apply <component_name>.upgrade.post
```
For example, for Nova installed on the cmp01 compute node, run:
```
salt cmp01 state.apply nova.upgrade.post
```

Note

On the clouds of medium and large sizes, you may want to automate this step. Use the following script as an example of possible automatization.

 #!/bin/bash
#List of formulas that implements upgrade API sorted by priority
all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | \
jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | \
sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
#List of nodes in cloud
list_nodes=`salt -C 'I@__reclass__:applications' test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
for node in $list_nodes; do
  #List of applications on the given node
  node_applications=$(salt $node pillar.items __reclass__:applications --out=json | \
  jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
  for component in $all_formulas ; do
    if [[ " ${node_applications[*]} " == *"$component"* ]]; then
      salt $node state.apply $component.upgrade.post
    fi
  done
done

Update Galera¶

To update the Galera cluster, use the Upgrade Galera procedure. Only the MySQL and Galera packages will be updated. The update of an underlying operating system is out of scope.

Update RabbitMQ¶

To update the RabbitMQ component, use the Upgrade RabbitMQ procedure and the same Deploy - upgrade RabbitMQ server pipeline job.

Caution

We recommend that you do not upgrade or update OpenStack and RabbitMQ simultaneously. Upgrade or update the RabbitMQ component only once OpenStack is running on the new version.

Update the OpenContrail 4.x nodes¶

Caution

This section describes how to apply minor updates to the OpenContrail 4.x nodes, for example, from version 4.0 to 4.1.

To update the OpenContrail 3.2 packages, refer to the Upgrade the OpenContrail nodes to version 3.2 procedure.

Caution

OpenContrail 4.x for Kubernetes 1.12 or later is not supported.

Prerequisites¶

Before you start updating the OpenContrail nodes of your MCP cluster, complete the following prerequisite steps:

Configure the server and client backup roles for Cassandra and ZooKeeper as described in OpenContrail 4.x: Create a backup schedule for a Cassandra database and OpenContrail 4.x: Create a backup schedule for a ZooKeeper database.
Verify that all OpenContrail services are up and running on all OpenContrail nodes. See Verify the OpenContrail status.

Note

If you update MCP from version 2019.2.3, the contrail-webui service may be in the inactive state due to the missing quotation mark in /etc/contrail/config.global.js. To fix the issue, see: MCP 2019.2.3 maintenance updates.
Verify that you have enough free disk space on the OpenContrail ntw* and nal* nodes since the update pipeline job will download a new version of Docker images. The required disk utilization must not exceed 80% on each target node.

Once done, proceed to Prepare the cluster model.

Prepare the cluster model¶

After you complete the prerequisite steps, prepare your cluster model for the update by configuring your Git project repository as described below.

To prepare the cluster model:

Log in to the Salt Master node.
In cluster/<name>/opencontrail/init.yml file, update the OpenContrail version. For example:
```
_param:
  linux_repo_contrail_component: oc41
  opencontrail_version: 4.1
```
In cluster/<name>/openstack/dashboard.yml, update the OpenContrail version. For example:
```
_param:
  opencontrail_version: 4.1
```

In cluster/<name>/openstack/init.yml, verify that the following parameters contain correct values:

_param:
  opencontrail_admin_password: <contrail_user_password>
  opencontrail_admin_user: 'contrail'

If you update OpenContrail to version 4.1, in cluster/<name>/opencontrail/analytics.yml, add or change the following parameters:
```
_param:
  opencontrail_kafka_config_dir: '/etc/kafka'
  opencontrail_kafka_log_dir: '/var/log/kafka'
```

Add the OpenContrail update pipeline job:

Add the following class to cluster/cicd/control/leader.yml:

classes:
- system.jenkins.client.job.deploy.update.update_opencontrail4

Apply the jenkins.client state:

salt -C 'I@jenkins:client' state.sls jenkins.client

Once done, proceed to Update the OpenContrail nodes.

Update the OpenContrail nodes¶

After you prepare the cluster model of your MCP cluster, proceed to updating the OpenContrail controller nodes and the OpenContrail vRouter packages on the compute nodes.

Warning

During the update process, the following resources are affected:

The instance(s) running on the compute nodes can be unavailable for up to 30 seconds.
The creation of new instances is not possible during the same time interval.

Therefore, you must plan a maintenance window as well as test the update on a staging environment before applying it to production.

To update the OpenContrail nodes:

Log in to the Jenkins web UI.
Verify that you do not have any unapproved scripts in Jenkins:
1. Navigate to Manage Jenkins > In-process script approval.
2. Approve pending scripts if any.
Open the Deploy - update Opencontrail 4X pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
STAGE_CONTROLLERS_UPDATE	Select to update the OpenContrail controller nodes.

Click Deploy.

To update the OpenContrail vRouter packages on the compute nodes:

Log in to the Jenkins web UI.
Open the Deploy - update Opencontrail 4X pipeline.

Specify the following parameters:

Parameter	Description and values
COMPUTE_TARGET_SERVERS	Add `I@opencontrail:compute` or target the global name of your compute nodes, for example `cmp001*`.
COMPUTE_TARGET_SUBSET_LIVE	Add `1` to run the update first on only one of the nodes defined in the COMPUTE_TARGET_SERVERS field. After this stage is done, in the pipeline job console, you will be asked to continue the update the remaining nodes defined in the COMPUTE_TARGET_SERVERS field.
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
STAGE_COMPUTES_UPDATE	Select to update the compute nodes.

Click Deploy. For details how to monitor the deployment process, see: MCP Deployment Guide: View the deployment details.
Log in to the Salt Master node.

Restart the vrouter-agent and vrouter-nodemgr services on the OpenContrail compute nodes:

salt -C 'I@opencontrail:compute' service.restart contrail-vrouter-agent
salt -C 'I@opencontrail:compute' service.restart contrail-vrouter-nodemgr

The Deploy - update Opencontrail 4X pipeline workflow:

If STAGE_COMPUTES_UPDATE is selected, update the OpenContrail packages on the compute nodes in two iterations:
1. Update the sample nodes defined by COMPUTE_TARGET_SUBSET_LIVE.
2. After a manual confirmation, update all compute nodes targeted in COMPUTE_TARGET_SERVERS.
If STAGE_CONTROLLERS_UPDATE is selected, download a new version of controller, analytics, and analytics db containers.
Stop the running analytics and analytics db containers on nal nodes and start the updated containers.
Stop the running controller containers on ntw nodes and start the updated containers.

Update Virtlet¶

This section describes how to install maintenance updates, for example, hot fixes, to Virtlet by updating its version to the latest supported one.

During the update, the Virtlet VM can not be accessible using the kubectl attach command. But it still has the network connectivity. The MCP cluster workload is not affected.

To update Virtlet:

Open your project Git repository with Reclass model on the cluster level.

Change the Virtlet version in /kubernetes/compute.yml:

parameters:
  kubernetes:
    common:
      addons:
        virtlet:
          enabled: true
          namespace: kube-system
          image: mirantis/virtlet:<virtlet_version>

Apply the following states:

salt -C 'I@kubernetes:master' state.sls kubernetes.master.kube-addons
salt -C 'I@kubernetes:master' state.sls kubernetes.master.setup

Verify that Virtlet is updated successfully:
1. Log in to any Kubernetes Master node.
2. Run the following command:
```
kubectl rollout status -n kube-system ds/virtlet
```
  Note
  
  Substitute virtlet with a custom name if the default virtlet name was changed.

See also

Update or upgrade Kubernetes

Obtain Ubuntu security updates¶

You can obtain security updates for the host operating system packages of your OpenStack-based MCP cluster starting from the Build ID 2018.8.0 using the update repositories. The update repositories provide for update of the specific Ubuntu packages for the CI/CD, OpenStack, and StackLight LMA nodes.

To obtain Ubuntu security updates, proceed with one of the following:

For existing deployments:
1. Add the update repositories.
2. Apply the security updates as described below either manually or using the dedicated Jenkins pipeline.
For new deployments, starting from the Q3`18 Release Version, the update repositories are already included. Therefore, proceed with either Apply security updates manually or Apply Ubuntu security updates using Jenkins right away.

Caution

Due to a limitation, do not upgrade the cfg01 node.

Caution

Mirantis recommends that you plan a maintenance window due to rebooting of the nodes. The maintenance window time depends on the number of nodes, vRouters, and the complexity of workloads migration and is approximately two hours without regard to migration. During this period, MCP control plane may not be accessible. Network interrupts may occur during the namespaces rebuild because the gateway nodes will be rebooted. Workloads may be interrupted if you reboot the compute nodes.

Add an update repository¶

This section describes how to add an update repository for the existing MCP deployments starting from the Build ID 2018.8.0 to allow for obtaining the Ubuntu security updates.

To add an update repository:

Open your Git project repository with the Reclass model on the cluster level.

In the classes/cluster/<cluster_name>/infra/init.yml file, specify the following parameters:

parameters:
  _param:
    linux_system_repo_update_url: http://mirror.mirantis.com/update/${_param:mcp_version}/
    linux_system_repo_update_ubuntu_url: ${_param:linux_system_repo_update_url}/ubuntu/
  linux:
    system:
      repo:
        ubuntu_security_update:
          refresh_db: ${_param:linux_repo_refresh_db}
          source: "deb [arch=amd64] ${_param:linux_system_repo_update_ubuntu_url} ${_param:linux_system_codename}-security main restricted universe"
          architectures: amd64
          default: true
        mcp_saltstack:
          pin:
          - enabled: true
            pin: 'release o=SaltStack'
            priority: 50
            package: 'libsodium18'

Log in to the Salt Master node.
Apply the following state:
```
salt "*" state.sls linux.system.repo
```

Once done, proceed to Apply security updates manually.

Apply security updates manually¶

Once you have created an update repository as described in Add an update repository, or if you already have an update repository, you can apply the security updates.

This section instructs you on how to update Ubuntu packages manually. More specifically, the procedure includes update of the KVM nodes one by one, the VMs running on top of each KVM node, and the OpenStack compute nodes.

If you prefer the automated update, use the Apply Ubuntu security updates using Jenkins procedure.

Note

This documentation does not cover the upgrade of Ceph nodes.

To apply the Ubuntu security updates manually:

Log in to the Salt Master node.
For OpenStack Ocata with OpenContrail v3.2, update the python-concurrent.futures package. Otherwise, skip this step.
```
salt -C "ntw* or nal*" cmd.run "apt-get -y --force-yes upgrade python-concurrent.futures"
```

Update the Virtual Control Plane and KVM nodes:

Identify the KVM nodes:

salt "kvm*" test.ping

Example of system response:

kvm01.bud.mirantis.net:
    True
kvm03.bud.mirantis.net:
    True
kvm02.bud.mirantis.net:
    True

In the example above, three KVM nodes are identified: kvm01, kvm02, and kvm03.

Identify the VMs running on top of each KVM node. For example, for kvm01:

salt "kvm01*" cmd.run "virsh list --name"

Example of system response:

kvm01.bud.mirantis.net:
    cfg01
    prx01.<domain_name>
    ntw01.<domain_name>
    msg01.<domain_name>
    dbs01.<domain_name>
    ctl01.<domain_name>
    cid01.<domain_name>
    nal01.<domain_name>

Using the output of the previous command, upgrade the cid, ctl, gtw/ntw/nal, log, mon, msg, mtr, dbs, and prx VMs of the particular KVM node. Do not upgrade cfg01, cmp, and kvm.

Note

Ceph nodes are out of the scope of this procedure.

Example:

for NODE in prx01.<domain_name> \
            ntw01.<domain_name> \
            msg01.<domain_name> \
            dbs01.<domain_name> \
            ctl01.<domain_name> \
            cid01.<domain_name> \
            nal01.<domain_name>
do
  salt "${NODE}*" cmd.run "export DEBIAN_FRONTEND=noninteractive && \
                  apt-get update && \
                  apt-get -y upgrade && \
                  apt-get -y -o Dpkg::Options::="--force-confdef" \
                          -o Dpkg::Options::="--force-confnew" dist-upgrade"
done

Wait for all services of the cluster to be up and running.
1. If the KVM node hosts GlusterFS, verify the GlusterFS server and volumes statuses as described in Troubleshoot GlusterFS. Proceed with further steps only if the GlusterFS status is healthy.
2. If the KVM node hosts a dbs node, verify that the Galera cluster status is Synced and contains at least three nodes as described in Verify a Galera cluster status.
3. If the KVM node hosts a msg node, verify the RabbitMQ cluster status and that it contains at least three nodes as described in Troubleshoot RabbitMQ.
4. If the KVM node hosts OpenContrail 4.x, verify that its services are up and running as described in Verify the OpenContrail status. If any service fails, troubleshoot it as required. For details, see: Troubleshoot OpenContrail.

Once you upgrade the VMs running on top of a KVM node, upgrade and restart this KVM node itself. For example, kvm01:

salt "kvm01*" cmd.run "export DEBIAN_FRONTEND=noninteractive && \
                  apt-get update && \
                  apt-get -y upgrade && \
                  apt-get -y -o Dpkg::Options::="--force-confdef" \
                             -o Dpkg::Options::="--force-confnew" dist-upgrade && \
                  shutdown -r 0"

Wait for all services of the cluster to be up and running. For details, see substeps of the step 3.4.
Repeat the steps 3.1-3.7 for the remaining kvm nodes one by one.

Upgrade the OpenStack compute nodes using the example below, where cmp10. is a set of nodes.

Caution

Before upgrading a particular set of cmp nodes, migrate the critical cloud environment workloads, which should not be powered off, from these nodes to the cmp nodes that are not under maintenance.

Example:

# Check that the command will be applied to only needed nodes
salt -E "cmp10." test.ping
# Perform the upgrade
salt -E "cmp10." cmd.run "export DEBIAN_FRONTEND=noninteractive && apt-get \
update && apt-get -y upgrade && apt-get -y -o \
Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confnew" \
dist-upgrade && apt-get -y install linux-headers-generic && shutdown -r 0"

Apply Ubuntu security updates using Jenkins¶

Once you have created an update repository as described in Add an update repository, or if you already have an update repository, you can apply the security updates.

This section instructs you on how to update Ubuntu packages automatically. If you prefer the manual update, use the Apply security updates manually procedure.

To apply the Ubuntu security updates automatically:

Run the Deploy - update system package(s) pipeline specifying the following parameters as required:

**Deploy - update system package(s) pipeline parameters**¶
Parameter	Description
`SALT_MASTER_URL`	Full Salt API address, for example, `https://10.10.10.1:8000`
`SALT_MASTER_CREDENTIALS`	Credentials to the Salt API
`TARGET_SERVERS`	Salt compound target to match nodes to be updated. For example, `[*, G@osfamily:debian]`.
`TARGET_PACKAGES`	RSpace delimited list of packages to be updated. For example, `[package1=version package2=version]`. The empty string means updating all packages to the latest version.
`BATCH_SIZE` ^{Added since 2019.2.6 update}	The batch size for Salt commands targeted for a large amount of nodes. Set to an absolute number of nodes (integer) or percentage, for example, 20 or 20%. For details, see MCP Deployment Guide: Configure Salt Master threads and batching.

Update Ceph¶

Starting from the Build ID 2019.2.0, you can update Ceph packages to the latest minor versions on the Ceph OSD, Monitor, and RADOS Gateway nodes using the Update Ceph packages pipeline job. During the update, all Ceph services restart one by one and the pipeline does not proceed until the cluster is healthy.

Warning

If you are updating Ceph from a version prior 14.2.20, verify that the ceph:common:config:mon:auth_allow_insecure_global_id_reclaim pillar is unset or set to true.

Note

This procedure does not include the backup of Ceph VMs. To back up the Ceph VMs, perform the steps described in Back up and restore Ceph.

To update Ceph packages to the latest minor versions:

Verify that you have upgraded your MCP cluster to the latest Build ID as described in Upgrade DriveTrain to a newer release version.
Log in to the Jenkins web UI.
Open the Update Ceph packages pipeline job.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	Credentials for Salt API stored in Jenkins.
SALT_MASTER_URL	The URL of Salt API. For example, `https://10.10.10.1:8000`.
TARGET_SERVERS ^{Removed since 2019.2.5 update}	Salt compound target to match the nodes to be updated. For example, `G@osfamily:debian` or `*` to update the packages on all Ceph nodes.
CLUSTER_FLAGS ^{Added since 2019.2.7 update}	Add a comma-separated list of flags to apply before and after the pipeline execution.

Click Build.
Click Full stage view to track the update process.

Caution

The following table contains the details of the update stages:

The Update Ceph packages pipeline job workflow¶
#	Stage	Details
1	List the target servers	Selects the Ceph nodes from the available targets. Obtains the minions from TARGET_SERVERS to list all available targets and then selects the ones that include `rgw`, `cmn`, or `osd` in the name.
2	Apply package upgrades on all nodes	Updates and installs new Ceph packages on the selected nodes.
3	Restart the services	Restarts Ceph Monitor on all `cmn` nodes one by one. Starting from the 2019.2.8 update, restarts Ceph Manager on all `mgr` nodes one by one. Restarts Ceph OSDs on all `osd` nodes one by one. After a service restart on each node, the Jenkins pipeline job waits for the cluster to become healthy.

Update StackLight LMA¶

You can apply maintenance updates to your StackLight LMA deployment within the scope of a major MCP release version.

To update StackLight LMA within the scope of a major MCP release version, use the Upgrade StackLight LMA to Build ID 2019.2.0 procedure and the same Deploy - upgrade Stacklight pipeline job. Before updating StackLight LMA, you may need to perform manual steps or select specific pipeline job parameters. For details on a particular maintenance update, see: MCP Release Notes: Maintenance updates.

Cloud verification¶

MCP enables the deployment, QA, support engineers, and cloud operators to perform functional as well as non-functional types of testing of cloud environments through DriveTrain using the Cloud Verification Pipeline (CVP) tooling.

CVP is a set of tools, procedures, and components allowing for automatic MCP deployments verification. CVP can be applied to the newly built deployments to verify the functionality, performance, and fault tolerance of cloud environments as well as to the exisiting deployments to verify their functionality and performance before and after any changes.

The main characteristics of CVP inlude:

Non-destructive testing
Resources cleanup after test run
Fully automated test runs
Granular configurable testing
Extensible tests enabling the operator to add custom tests in the PyTest format and extend the Tempest test framework for OpenStack functional tests

CVP requires the Internet connection to the following URLs:

https://github.com (for repositories)
https://files.pythonhosted.org and https://pypi.org (for pip packages)
http://download.cirros-cloud.net (preferable but not required)

If HTTP or HTTPS proxies are in use, they must be present in the Docker daemon configuration.

CVP pipelines work in the offline mode but some adjustments may be needed for MCP pipelines.

CVP pipelines¶
Jenkins pipeline	Description	Source code (GitHub)
CVP - Sanity checks	Sanity testing to verify the cloud platform infrastructure components	Mirantis/cvp-sanity-checks
CVP - Functional tests	OpenStack functional integration testing (Tempest)	openstack/tempest Mirantis/cvp-configuration
CVP - Performance tests	OpenStack Rally-based baseline performance testing	Mirantis/cvp-configuration
CVP - HA tests	OpenStack high availability testing	openstack/tempest Mirantis/cvp-configuration
CVP - StackLight tests	StackLight LMA basic verification	Mirantis/stacklight-pytest
CVP - Shaker network tests	Data plane networking test	Mirantis/cvp-shaker

CVP pipelines workflow¶
#	Details
1	Enable CVP in the Reclass model if required.
2	Adjust a test set to the deployment-specific requirements if needed.
3	Configure a related Jenkins pipeline.
4	Build the pipeline.
5	Review test results.

This section explains how to perform different types of cloud verification using corresponding Jenkins pipelines.

Enable CVP pipelines in the deployment model¶

Cloud verification pipelines are included in the MCP DriveTrain by default. If your deployment does not contain CVP pipelines, you can enable the pipelines in your Reclass model as described in this section. You can also schedule a recurring build job for CVP pipelines as required.

To enable CVP pipelines:

Log in to the Salt Master node.
In the classes/cluster/<CLUSTER_NAME>/cicd/control/leader.yml file, add the following class:
```
- system.jenkins.client.job.validate
```

Apply the jenkins.client.job state:

salt 'cid01*' state.apply jenkins.client.job

To schedule a recurring build for a job:

Log in to the Salt Master node.
Modify the classes/cluster/<CLUSTER_NAME>/cicd/control/leader.yml file as required. For example:
```
jenkins:
  client:
    job:
      cvp-sanity:
        trigger:
          timer:
            spec: 'H 8 * * *'
```
Note

The spec field follows the cron syntax. For more details, see: TimerTrigger.
Apply the jenkins.client.job state for the cid01* node:
```
salt 'cid01*' state.apply jenkins.client.job
```

Applying the change in the example above schedules the cvp-sanity job execution to 8 am UTC every day.

Perform sanity testing¶

Sanity tests can be used for basic verification of your MCP deployment helping to troubleshoot infrastructure health and verify whether the enviroment is ready for further testing. You can perform the sanity testing of your environment using the CVP - Sanity checks Jenkins pipeline.

Execute the CVP - Sanity checks pipeline¶

This section instructs you on how to perform the sanity testing of your deployment using the CVP - Sanity checks Jenkins pipeline.

To perform the sanity testing of your deployment:

In a web browser, open http://<ip_address>:8081 to access the Jenkins web UI.

Note

The IP address is defined in the classes/cluster/<cluster_name>/cicd/init.yml file of the Reclass model under the cicd_control_address parameter variable.
Log in to the Jenkins web UI as admin.

Note

To obtain the password for the admin user, run the salt "cid*" pillar.data _param:jenkins_admin_password command from the Salt Master node.
In the global view, find the CVP - Sanity checks pipeline.
Select the Build with Parameters option from the drop-down menu of the pipeline.

Configure the following parameters as required:

**CVP - Sanity checks parameters**¶
Parameter	Description
DEBUG_MODE ^{Removed since 2019.2.4 update}	If checked, keeps the container (if the IMAGE parameter is defined) after the test is performed for the debugging purposes. This option is deprecated and ignored starting the Q4`18 MCP release.
IMAGE	Specifies the `cvp-sanity` Docker image (with all dependencies) that will be used during the test run. For offline mode, use the URL from the local artifactory or offline image.
PROXY ^{Removed since 2019.2.4 update}	If an environment uses HTTP or HTTPS proxy, verify that you specify it in this field as this proxy address will be used to clone the required repositories and install the Python requirements. This option is deprecated and ignored starting the Q4`18 MCP release.
SALT_MASTER_CREDENTIALS	Specifies the credentials to Salt API stored in Jenkins, included by default. See View credentials details used in Jenkins pipelines.
SALT_MASTER_URL	Specifies the reachable IP address of the Salt Master node and port on which Salt API listens. For example, `http://172.18.170.28:6969`. To determine on which port Salt API listens: Log in to the Salt Master node. Search for the port in the `/etc/salt/master.d/_api.conf` file. Verify that the Salt Master node is listening on that port: netstat -tunelp \| grep <PORT>
TESTS_REPO ^{Removed since 2019.2.4 update}	Specifies the repository with the sanity tests that can be either a Github URL or an internal Gerrit repository with custom tests. By default, the value for this parameter is empty. Since this option is deprecated and ignored starting the Q4`18 release, Mirantis recommends leaving this field empty.
TESTS_SET ^{Removed since 2019.2.4 update}	Specifies the name of the test to perform or directory to discover tests. By default, it is `cvp-sanity/cvp_checks/tests`. Leave the field as is for a full test run or specify the test filename. For example, the `<default_path>/test_repo_list.py` will only run the test for repositories. If a custom image is used and the TESTS_REPO field is empty, provide the full path to the folder with tests.
TESTS_SETTINGS ^{Removed since 2019.2.4 update}	Specifies additional environment variables that can be passed to the framework. You can override configuration values with these variables. For example, `export skipped_nodes=mtr01.local,log02.local` will force the job to skip the specified nodes in all tests. See global_config.yaml for details.
EXTRA_PARAMS ^{Added since 2019.2.4 update}	Specifies additional environment variables in the YAML format using the `envs` key. For example: envs: - tests_set=tests/test_drivetrain.py - skipped_nodes=mtr01.local,log02.local You can override configuration values with these variables. For example, `skipped_nodes=mtr01.local,log02.local` will force the job to skip the specified nodes in all tests. See global_config.yaml for details. Do not use quotes and spaces in the values of parameters. ^{Added since 2019.2.5} Use the `force_pull` parameter to enable or disable the pull operation for an image before running the container. Set `force_pull=false` if image pulling is impossible or not required. In this case, the image may be manually uploaded to the target node. ^{Added since 2019.2.7} You can override the configuration values using the `override_config` variable. For example: override_config: skipped_nodes: - log02 - apt03

Note

To perform the DriveTrain sanity testing:

For MCP versions starting from the 2019.2.4 maintenance update, specify the EXTRA_PARAMS parameters as described in Perform DriveTrain sanity testing.
For MCP versions before the 2019.2.4 maintencance update, specify the TESTS_SET and TESTS_SETTINGS parameters as described in Perform DriveTrain sanity testing.

Click Build.
Verify the job status:
- GREEN, SUCCESS
  Testing has beed performed successfully, no errors found.
- YELLOW, UNSTABLE
  Some errors occurred during the test run. Proceed to Review the CVP sanity test results.
- RED, FAILURE
  Testing has failed due to issues with the framework or/and pipeline configuration. Review the console output.

Perform DriveTrain sanity testing¶

This section instructs you on how to perform the DriveTrain sanity testing of your deployment using the CPV - Sanity checks Jenkins pipeline. The sanity checks for DriveTrain include testing of the Gerrit, Jenkins, and OpenLDAP functionality, as well as the DriveTrain services replicas, components and versions, and the Jenkins job branch.

To perform the DriveTrain sanity testing:

Perform the steps 1-4 as described in Execute the CVP - Sanity checks pipeline.
Using the step 5 of the same procedure, specify the IMAGE, SALT_MASTER_CREDENTIALS, SALT_MASTER_URL parameters.

Configure the following parameters as required:

Parameter Description

TESTS_SET ^{Removed since 2019.2.4 update} Specify cvp-sanity/cvp_checks/tests/test_drivetrain.py to perform only the DriveTrain sanity testing.

TESTS_SETTINGS ^{Removed since 2019.2.4 update} drivetrain_version=<your_mcp_version>

EXTRA_PARAMS ^{Added since 2019.2.4}

Specify tests/test_drivetrain.py to perform only the DriveTrain sanity testing. For example:

envs:
  - tests_set=tests/test_drivetrain.py

^{Added since 2019.2.7} You can override the configuration values using the override_config variable. For example:

override_config:
  skipped_nodes:
    - log02
    - apt03

Perform the remaining steps of the Execute the CVP - Sanity checks pipeline procedure.

The CPV - Sanity checks pipeline workflow for DriveTrain:

Run the test_drivetrain_openldap job:
1. Obtain the LDAP server and port from the _param:haproxy_openldap_bind_port and _param:haproxy_openldap_bind_host pillars.
2. Obtain the LDAP admin password from the _param:openldap_admin_password pillar.
3. Using the admin user name from the openldap:client:server:auth:user pillar and password from the previous step, connect to the LDAP server.
4. Create a test user DT_test_user.
5. Add the created test user to the admins group.
6. Using the test user credentials, try obtaining the information about the cvp-sanity job.
7. Using the test user credentials, try connecting to the Gerrit server and check the list of patches created by this user.
8. Delete the user from the admins group and from the LDAP server.
Run the test_drivetrain_jenkins_job job:
1. Obtain the Jenkins server and port from the _param:haproxy_jenkins_bind_port and _param:haproxy_jenkins_bind_host pillars.
2. Obtain the Jenkins admin password from the _param:openldap_admin_password pillar.
3. Connect to Jenkins using the credentials from the previous steps.
4. Execute <jenkins_test_job> on the cid node.
  
  Note
  
  The pipeline checks whether a DT-test-job exists and if it does not exist - creates a new empty DT-test-job job. You can specify another job using the jenkins_test_job option in the TESTS_SETTINGS parameter.
5. Wait 180 seconds for the SUCCESS status of the job.
Run the test_drivetrain_gerrit job:
1. Obtain the Gerrit server and port from the _param:haproxy_gerrit_bind_port and _param:haproxy_gerrit_bind_host pillars.
2. Obtain the Gerrit admin password from the _param:openldap_admin_password pillar.
3. Connect to Gerrit using the credentials from the previous steps.
4. Add a new test-dt-<current_date> project.
5. Create a new file and send it on review.
6. As an admin user, add Code-Review +2 to the patch and merge it.
7. Delete the test project.
Run the test_drivetrain_services_replicas job:
1. Obtain the list of Docker services from the cid node.
2. Compare the number of expected and actual number of replicas.
Run the test_drivetrain_components_and_versions job:
1. Obtain the list of Docker services from the cid node.
2. Compare the list of expected and actual list of the services.
3. Compare the version of the service with the drivetrain_version version.
  
  Note
  
  ^{Prior to the 2019.2.4 update} The test will be skipped if drivetrain_version was not defined in the TESTS_SETTINGS parameter.
Run the test_jenkins_jobs_branch job:
1. Obtain the list of all jobs from Jenkins.
2. Compare the versions of the jobs with the drivetrain_version version.
  
  Note
  
  ^{Prior to the 2019.2.4 update} The test will be skipped if drivetrain_version was not defined in the TESTS_SETTINGS parameter.
  
  In MCP Build ID 2019.2.0, this test will fail due to an issue with all Jenkins jobs having the release/2019.2.0 version instead of 2019.2.0. This issue has been fixed since the 2019.2.2 maintenance update.
  
  The master version for the deploy-update-salt job is an expected behavior.

Review the CVP sanity test results¶

This section explains how to review the CVP Sanity checks trace logs from the Jenkins web UI.

To review the CVP Sanity checks results:

Log in to the Jenkins web UI.
Navigate to the build that you want to review.
Find Test Results at the bottom of the Build page.
Scroll down and click Show all failed tests to view a complete list of the failed tests.
Click on the test of concern for details.
Note

A test name corresponds to the test file path in the test source repository. For example:
- For MCP versions starting from the 2019.2.4 maintenance update, tests.test_mtu.test_mtu[mtr] corresponds to cvp-sanity/tests/test_mtu.py.
- For MCP versions before the 2019.2.4 maintenance update, cvp_checks.tests.test_mtu.test_mtu[mtr] corresponds to cvp-sanity-checks/cvp_checks/tests/test_mtu.py.
Review the lines beginning with the E letter in the trace.

The following example illustrates the error for the mtr01 node, the mtu value for ens2 interface for which differs from other MTUs in the mtr group (1500 versus 1450).

Note

If you do not see the full log trace, use the Console output -> View as plain text Jenkins menu instead.

Perform OpenStack functional integration testing¶

The CVP - Functional tests pipeline performs functional integration testing of the OpenStack environments. The pipeline includes the OpenStack Tempest testing.

Note

The pipeline is provided as is. It contains default settings, examples, and templates that may need adjustment for a deployed environment. Mirantis does not support the third-party components like Tempest or Rally used in the CVP tooling.

Execute the CVP - Functional tests pipeline¶

This section instructs you on how to perform the OpenStack functional integration testing using the CVP - Functional tests Jenkins pipeline.

Note

Clone the cvp-configuration and tempest repositories to your local Gerrit and use them locally to add new or adjust the existing tests and fine-tune the Tempest configuration.

To perform functional integration testing of your deployment:

In your local cvp-configuration repository, inspect and modify the following items as required:
- The Tempest configuration (tempest/tempest_ext.conf)
- The skip list (tempest/skip-list* files)
- The configure.sh setup script
Tempest 18.0.0 and Rally 0.11.2 are default versions for CVP - Functional tests Jenkins pipeline job in the offline and online mode.
In a web browser, open http://<ip_address>:8081 to access the Jenkins web UI.

Note

The IP address is defined in the classes/cluster/<cluster_name>/cicd/init.yml file of the Reclass model under the cicd_control_address parameter variable.
Log in to the Jenkins web UI as admin.

Note

To obtain the password for the admin user, run the salt "cid*" pillar.data _param:jenkins_admin_password command from the Salt Master node.
In the global view, find the CVP - Functional tests pipeline.
Select the Build with Parameters option from the drop-down menu of the pipeline.

Configure the following parameters as required:

**CVP - Functional tests parameters**¶
Parameter	Description
DEBUG_MODE	If checked, keeps the container after test is performed for the debugging purposes.
PROXY	If an environment uses HTTP or HTTPS proxy, verify that you specify it in this field as this proxy address will be used to clone the required repositories and install the Python requirements. For the offline mode, specify `offline`, no additional packages or modules will be pulled from the Internet.
SALT_MASTER_CREDENTIALS	Specifies the credentials to Salt API stored in Jenkins, included by default. See View credentials details used in Jenkins pipelines.
SALT_MASTER_URL	Specifies the reachable IP address of the Salt Master node and port on which Salt API listens. For example, `http://172.18.170.28:6969` To determine on which port Salt API listens: Log in to the Salt Master node. Search for the port in the `/etc/salt/master.d/_api.conf` file. Verify that the Salt Master node is listening on that port: netstat -tunelp \| grep <PORT> Caution In the 2019.2.4 update, by default, the HTTPS (SSL) NGINX URL is used as SALT_MASTER_URL. For long-running pipelines, it may lead to a timeout error, usually after 10 minutes. Therefore, specify the Salt API native non-SSL URL as described above to prevent timeout errors. Starting from the 2019.2.5 update, Salt API native non-SSL URL is used by default.
SKIP_LIST_PATH	Optional. Specifies the path to the tempest skip list file located in the repository specified in TOOLS_REPO. The default value is `cvp-configuration/tempest/skip-list.yaml`. For the offline mode, when `cvp-rally` image is used, define `/var/lib/cvp-configuration/tempest/skip-list.yaml`. If you use a custom image, specify the path to the custom skip list file.
TARGET_NODE	Specifies the node to run the container with Tempest/Rally. Use the Jenkins slave as it has the Docker package. Note Starting from the MCP 2019.2.2 update, if the TARGET_NODE parameter is empty, the node with the `gerrit:client` pillar will be used, which is `cid01` by default.
TEMPEST_ENDPOINT_TYPE	Sets the type of the Openstack endpoint to use during the test run.
TEMPEST_REPO	Specifies the Tempest repository to clone and use. By default, it is the upstream Tempest. Though, you can specify your customized tempest in a local or remote repository. For the full offline mode, specify `/var/lib/tempest` (a path inside a container) and `cvp-rally` image.
TEMPEST_TEST_PATTERN	Specifies the tests to run. See the Rally documentation for all available options.
TEST_IMAGE	Specifies the link to the Docker Rally-based image to use for running the container with testing tools. We recommend using the upstream Rally image `xrally/xrally-openstack:0.11.2`. For the full offline mode, use the `cvp-rally` image from the local Docker images mirror (Registry) or pull `docker-prod-local.docker.mirantis.net/mirantis/cvp/cvp-rally:<mcp_version>` locally.
TOOLS_REPO	Includes the URL or path to the repository where testing tools, scenarios, and configurations are located. By default, it is `https://github.com/Mirantis/cvp-configuration`. Specify your MCP version here. For example, `-b release/2019.2.0` for Q4`18. Caution For the Q4`18 MCP release, the branch name format is `release/2019.2.0` for the `cvp-configuraiton` repository. The old `2019.2.0` format is deprecated. To customize the configuration, clone the `cvp-configuration` repository to your local Gerrit and commit the changes. For the offline mode, specify `/var/lib/cvp-configuration/configure.sh`. Alternatively, use this repository from the offline image. If your image is fully configured, leave the field empty.

Click Build.
Verify the job status:
- GREEN, SUCCESS
  Testing has beed performed successfully, no errors found.
- YELLOW, UNSTABLE
  Some errors occurred during the test run. Proceed to Review the CVP - Functional tests pipeline results.
- RED, FAILURE
  Testing has failed due to issues with the framework or/and pipeline configuration. Review the console output.

Review the CVP - Functional tests pipeline results¶

This section explains how to review the CVP Functional tests trace logs from the Jenkins web UI.

To review the CVP Functional tests results:

Log in to the Jenkins web UI.
Navigate to the build that you want to review.
Find Test Results at the bottom of the Build page.
Scroll down and click Show all failed tests to view the complete list of the failed tests.
Click on the test of concern for details.

Note

If + does not expand, use a different browser or a public URL for your Jenkins.

Add new tests to the CVP - Functional tests pipeline¶

This section includes the instruction on how to extend the predefined OpenStack functional integration tests.

To add custom functional integration tests:

Clone the openstack/tempest Git repository to your local Gerrit repository.
Check out the tempest repository from your local Gerrit.
If required, create a new directory.
Create a file for the new test or update the existing test file.
Add the test code. Refer to Tempest Test Writing Guide in the OpenStack official documentation.
Commit the changes to the local Gerrit.
To perform the newly added test, run the CVP - Functional tests pipeline as described in Execute the CVP - Functional tests pipeline, specifying the TEMPEST_TEST_PATTERN parameter according to your test/test class/suite name.

Note

Verify, that the TEMPEST_REPO value matches your local Gerrit copy of tempest.

Perform OpenStack performance (load) testing¶

MCP DriveTrain enables you to perform the Rally-based baseline performance testing of your MCP OpenStack deployment using the CVP - Performance tests Jenkins pipeline.

Note

We recommend using the Rally image v0.11.2.

If you need to customize Rally test scenarios, refer to the official Rally documentation.

Note

Install the Performance Jenkins plugin¶

Before you proceed, verify that the Performance Jenkins plugin is installed.

To install the Performance Jenkins plugin:

Log in to the Jenkins web UI.
Navigate to Manage Jenkins > Manage Plugins.
Under the Available tab, search for the Performance plugin, and check it.
Click Install without restart.

Execute the CVP - Performance tests pipeline¶

This section instructs you on how to perform the OpenStack performance (load) testing of your deployment using the CVP - Performance tests Jenkins pipeline.

Note

Clone the cvp-configuration to your local Gerrit and use it locally to add new or adjust the existing Rally scenarios.

To perform the OpenStack performance testing:

In your local cvp-configuration repository, inspect and modify the following items as required:
- The Rally scenarios (rally/rally_scenarios* files)
- The configure.sh setup script
In a web browser, open http://<ip_address>:8081 to access the Jenkins web UI.

Note

The IP address is defined in the classes/cluster/<cluster_name>/cicd/init.yml file of the Reclass model under the cicd_control_address parameter variable.
Log in to the Jenkins web UI as admin.

Note

To obtain the password for the admin user, run the salt "cid*" pillar.data _param:jenkins_admin_password command from the Salt Master node.
In the global view, find the CVP - Performance tests pipeline.
Select the Build with Parameters option from the drop-down menu of the pipeline.

Configure the following parameters as required:

**CVP - Performance tests parameters**¶
Parameter	Description
DEBUG_MODE	If checked, keeps the container after test is performed for the debugging purposes.
PROXY	If an environment uses HTTP or HTTPS proxy, verify that you specify it in this field as this proxy address will be used to clone the required repositories and install the Python requirements. For the offline mode, specify `offline`, no additional packages or modules will be pulled from the Internet.
RALLY_SCENARIO_FILE	Specifies the path to the Rally scenarios file located in the repository specified in TOOLS_REPO. The default value is `cvp-configuration/rally/rally_scenarios.json`. For the offline mode, when `cvp-rally` image is used, define `/var/lib/cvp-configuration/rally/rally_scenarios.json`. If you use a custom image, specify the path to the custom skip list file. The `cvp-configuration` repository contains a default set of scenarios for a dry run in `rally/rally_scenarios.json` and the same set but with 100 iterations and 10 threads in `rally/rally_scenarios_100.json`. More advanced scenarios with floating IPs and live migration are available in `rally/rally_scenarios_fip_and_ubuntu.json` and `rally/rally_scenarios_fip_and_ubuntu_100.json`.
SALT_MASTER_CREDENTIALS	Specifies the credentials to Salt API stored in Jenkins, included by default. See View credentials details used in Jenkins pipelines.
SALT_MASTER_URL	Specifies the reachable IP address of the Salt Master node and port on which Salt API listens. For example, `http://172.18.170.28:6969` To determine on which port Salt API listens: Log in to the Salt Master node. Search for the port in the `/etc/salt/master.d/_api.conf` file. Verify that the Salt Master node is listening on that port: netstat -tunelp \| grep <PORT> Caution In the 2019.2.4 update, by default, the HTTPS (SSL) NGINX URL is used as SALT_MASTER_URL. For long-running pipelines, it may lead to a timeout error, usually after 10 minutes. Therefore, specify the Salt API native non-SSL URL as described above to prevent timeout errors. Starting from the 2019.2.5 update, Salt API native non-SSL URL is used by default.
TARGET_NODE	Specifies the node to run the container with Tempest/Rally. Use the Jenkins slave as it has the Docker package. Note Starting from the MCP 2019.2.2 update, if the TARGET_NODE parameter is empty, the node with the `gerrit:client` pillar will be used, which is `cid01` by default.
TEST_IMAGE	Specifies the link to the Docker Rally-based image to use for running the container with testing tools. We recommend using the upstream Rally image, `xrally/xrally-openstack:0.11.2`. For the full offline mode, use the `cvp-rally` image from the local Docker images mirror (Registry) or pull `docker-prod-local.docker.mirantis.net/mirantis/cvp/cvp-rally:<mcp_version>` locally.
TOOLS_REPO	Includes the URL or path to the repository where testing tools, scenarios, and configurations are located. By default, it is `https://github.com/Mirantis/cvp-configuration`. Specify your MCP version here. For example, `-b release/2019.2.0` for Q4`18. Caution For the Q4`18 MCP release, the branch name format is `release/2019.2.0` for the `cvp-configuraiton` repository. The old `2019.2.0` format is deprecated. To customize the configuration, clone the `cvp-configuration` repository to your local Gerrit and commit the changes. For the offline mode, specify `/var/lib/cvp-configuration/configure.sh`. Alternatively, use this repository from the offline image. If your image is fully configured, leave the field empty.

Click Build.
Verify the job status:
- GREEN, SUCCESS
  Testing has beed performed successfully, no errors found.
- YELLOW, UNSTABLE
  Some errors occurred during the test run. Proceed to Review the CVP - Performance pipeline tests results.
- RED, FAILURE
  Testing has failed due to issues with the framework or/and pipeline configuration. Review the console output.

Review the CVP - Performance pipeline tests results¶

This section explains how to review the CVP Performance tests trace logs from the Jenkins web UI.

To review the CVP - Performance pipeline tests results:

Log in to the Jenkins web UI.
Navigate to the build that you want to review.
Find Test Results at the bottom of the Build page.
Scroll down and click Show all failed tests to view the complete list of the failed tests.
Click on the test of concern for details.

Note

If + does not expand, use a different browser or a public URL for your Jenkins.

Perform the OpenStack high availability testing¶

The OpenStack high availability testing is aimed to perform the non-functional failover testing of the Openstack Virtualized Control Plane (VCP) nodes such as the OpenStack controller, database, networking nodes. You can also check other nodes, for example, StackLight.

You can perform the high availability testing of your MCP OpenStack deployment using the CVP - HA tests Jenkins pipeline.

Note

Execute the CVP - HA tests pipeline¶

This section instructs you on how to perform non-functional failover testing of OpenStack nodes that include but are not limited to the ctl, ntw, and dbs virtual machines using the CVP - HA tests Jenkins pipeline.

Note

Clone the cvp-configuration and tempest repositories to your local Gerrit and use them locally to add new or adjust the existing tests and fine-tune the Tempest configuration.

To perform the non-functional failover testing of your OpenStack deployment:

In your local cvp-configuration repository, inspect and modify the following items as required:
- The Tempest configuration
- The skip list
- The configure.sh setup script
  
  Note
  
  We recommend running CVP - HA tests Jenkins pipeline after the functional/integration testing is completed and all issues are resolved.
In a web browser, open http://<ip_address>:8081 to access the Jenkins web UI.

Note

The IP address is defined in the classes/cluster/<cluster_name>/cicd/init.yml file of the Reclass model under the cicd_control_address parameter variable.
Log in to the Jenkins web UI as admin.

Note

To obtain the password for the admin user, run the salt "cid*" pillar.data _param:jenkins_admin_password command from the Salt Master node.
In the global view, find the CVP - HA tests pipeline.
Select the Build with Parameters option from the drop-down menu of the pipeline.

Configure the following parameters as required:

**CVP - HA tests parameters**¶
Parameter	Description
DEBUG_MODE	If checked, keeps the container after test is performed for the debugging purposes.
MANUAL_CONFIRMATION	If checked, you will be asked for a confirmation before any destructive actions such as a node reboot or shutdown.
PROXY	If an environment uses HTTP or HTTPS proxy, verify that you specify it in this field as this proxy address will be used to clone the required repositories and install the Python requirements. For the offline mode, specify `offline`.
RETRY_CHECK_STATUS	Specifies the number of retries to check the node status. If you have any issues with timeouts, increase the default `200` value.
SALT_MASTER_CREDENTIALS	Specifies the credentials to Salt API stored in Jenkins, included by default. See View credentials details used in Jenkins pipelines.
SALT_MASTER_URL	Specifies the reachable IP address of the Salt Master node and port on which Salt API listens. For example, `http://172.18.170.28:6969` To determine on which port Salt API listens: Log in to the Salt Master node. Search for the port in the `/etc/salt/master.d/_api.conf` file. Verify that the Salt Master node is listening on that port: netstat -tunelp \| grep <PORT> Caution In the 2019.2.4 update, by default, the HTTPS (SSL) NGINX URL is used as SALT_MASTER_URL. For long-running pipelines, it may lead to a timeout error, usually after 10 minutes. Therefore, specify the Salt API native non-SSL URL as described above to prevent timeout errors. Starting from the 2019.2.5 update, Salt API native non-SSL URL is used by default.
SKIP_LIST_PATH	Optional. Specifies the path to the tempest skip list file located in the repository specified in TOOLS_REPO. The default value is `cvp-configuration/tempest/skip-list.yaml`. For the offline mode, when `cvp-rally` image is used, define `/var/lib/cvp-configuration/tempest/skip-list.yaml`. If you use a custom image, specify the path to the custom skip list file.
TARGET_NODES	Specifies OpenStack control plane nodes that will be under the HA test. For example, `ctl*` will include all nodes that have `ctl` at the beginning of their names (usually controllers).
TEMPEST_REPO	Specifies the Tempest repository to clone and use. By default, it is the upstream Tempest. Though, you can specify your customized tempest in a local or remote repository. For the full offline mode, specify `/var/lib/tempest` (a path inside a container) and `cvp-rally` image.
TEMPEST_TARGET_NODE	Specifies the node to run the container with Tempest/Rally. Use the Jenkins slave as it has the Docker package. Note Starting from the MCP 2019.2.2 update, if the TARGET_NODE parameter is empty, the node with the `gerrit:client` pillar will be used, which is `cid01` by default.
TEMPEST_TEST_PATTERN	Specifies the tests to run. See the Rally documentation for all available options.
TEST_IMAGE	Specifies the link to the Docker Rally-based image to use for running the container with testing tools. We recommend using the upstream Rally image, `xrally/xrally-openstack:0.11.2`. For the full offline mode, use the `cvp-rally` image from the local Docker images mirror (Registry) or pull `docker-prod-local.docker.mirantis.net/mirantis/cvp/cvp-rally:<mcp_version>` locally.
TOOLS_REPO	Includes the URL or path to the repository where testing tools, scenarios, and configurations are located. By default, it is `https://github.com/Mirantis/cvp-configuration`. Specify your MCP version here. For example, `-b release/2019.2.0` for Q4`18. Caution For the Q4`18 MCP release, the branch name format is `release/2019.2.0` for the `cvp-configuraiton` repository. The old `2019.2.0` format is deprecated. To customize the configuration, clone the `cvp-configuration` repository to your local Gerrit and commit the changes. For the offline mode, specify `/var/lib/cvp-configuration/configure.sh`. Alternatively, use this repository from the offline image. If your image is fully configured, leave the field empty.

Click Build.
Verify the job stages statuses in the Stage view section:
- GREEN, SUCCESS
  Testing has beed performed successfully, no errors found.
- RED, FAILURE
  Tempest run has failed during one of the job stages causing the pipeline abort. Review the console output.

Add new tests to the CVP - HA tests pipeline¶

This section includes the instruction on how to extend the predefined OpenStack high availability tests.

To add custom OpenStack high availability tests:

Clone the openstack/tempest Git repository to your local Gerrit repository.
Check out the tempest repository from your local Gerrit.
If required, create a new directory.
Create a file for the new test or update the existing test file.
Add the test code. Refer to Tempest Test Writing Guide in the OpenStack official documentation.
Commit the changes to the local Gerrit.
To perform the newly added test, run the CVP - Functional tests pipeline as described in Execute the CVP - HA tests pipeline, specifying the TEMPEST_TEST_PATTERN parameter according to your test/test class/suite name.

Note

Verify, that the TEMPEST_REPO value matches your local Gerrit copy of tempest.

Verify StackLight LMA¶

StackLight LMA testing through the Jenkins web UI is aimed to perform the basic verification of your StackLight LMA deployment helping to troubleshoot the StackLight LMA cluster health.

You can perform the StackLight LMA verification using the CVP - StackLight tests pipeline.

Note

The pipeline is provided as is. It contains default set of tests that may need adjustment for a deployed environment.

Execute the CVP - StackLight tests pipeline¶

This section instructs you on how to perform the StackLight LMA verification using the CVP - StackLight tests Jenkins pipeline.

Note

For MCP versions prior the 2019.2.4 update, clone the release/2019.2.0 branch of the stacklight-pytest repository to your local Gerrit and use it locally to add new or adjust the existing tests.

To perform the StackLight LMA verification:

In a web browser, open http://<ip_address>:8081 to access the Jenkins web UI.

Note

The IP address is defined in the classes/cluster/<cluster_name>/cicd/init.yml file of the Reclass model under the cicd_control_address parameter variable.
Log in to the Jenkins web UI as admin.

Note

To obtain the password for the admin user, run the salt "cid*" pillar.data _param:jenkins_admin_password command from the Salt Master node.
In the global view, find the CVP - StackLight tests pipeline.
Select the Build with Parameters option from the drop-down menu of the pipeline.

Configure the following parameters as required:

**CVP - StackLight tests parameters**¶
Parameter	Description
PROXY ^{Removed since 2019.2.4 update}	If an environment uses HTTP or HTTPS proxy, verify that you specify it in this field as this proxy address will be used to clone the required repositories and install the Python requirements.
SALT_MASTER_CREDENTIALS	Specifies the credentials to Salt API stored in Jenkins, included by default. See View credentials details used in Jenkins pipelines.
SALT_MASTER_URL	Specifies the reachable IP address of the Salt Master node and port on which Salt API listens. For example, `http://172.18.170.28:6969` To determine on which port Salt API listens: Log in to the Salt Master node. Search for the port in the `/etc/salt/master.d/_api.conf` file. Verify that the Salt Master node is listening on that port: netstat -tunelp \| grep <PORT>
TESTS_REPO ^{Removed since 2019.2.4 update}	Specifies the repository with the Stacklight tests that can be either a Github URL or an internal Gerrit repository with custom tests. By default, it is `http://gerrit.mcp.mirantis.com/mcp/stacklight-pytest -b release/2019.2.0` that clones the `release/2019.2.0` branch of the `stacklight-pytest` repository.
TESTS_SET ^{Removed since 2019.2.4 update}	Specifies the name of the test to perform or directory to discover tests. By default, it is `stacklight-pytest/stacklight_tests/tests/`. Leave the field as is for a full test run or specify the test filename. For example, the `<default_path>/test_logs.py` will only run the test for Kibana.
TESTS_SETTINGS ^{Removed since 2019.2.4 update}	Includes additional environment variables that can be passed to the test framework to override the default configuration. Always specify the `SL_AUTOCONF=True` and `PYTHONPATH="./stacklight-pytest"` options. Use semicolon to separate variables. The `export skipped_nodes=mtr01.local,log02.local` string will force the pipeline to skip the `mtr01` and `log02` nodes. If you have the UPG nodes in your deployment, add them to the `skip_nodes` list. For example, `skipped_nodes=upg01.<domain_name>`.
EXTRA_PARAMS ^{Added since 2019.2.4}	Includes additional environment variables that can be passed to the test framework to override the default configuration. Always specify `- SL_AUTOCONF=True`. ^{Added since 2019.2.5} Use the `force_pull` parameter to enable or disable the pull operation for an image before running the container. Set `force_pull=false` if image pulling is impossible or not required. In this case, the image may be manually uploaded to the target node.

Click Build.
Verify the job status:
- GREEN, SUCCESS
  Testing has beed performed successfully, no errors found.
- YELLOW, UNSTABLE
  Some errors occurred during the test run. Proceed to Review the CVP - StackLight pipeline tests results.
- RED, FAILURE
  Testing has failed due to issues with the framework or/and pipeline configuration. Review the console output.

Review the CVP - StackLight pipeline tests results¶

This section explains how to review the CVP - StackLight tests trace logs from the Jenkins web UI.

To review the CVP - StackLight tests results:

Log in to the Jenkins web UI.
Navigate to the build that you want to review.
Find Test Results at the bottom of the Build page.
Scroll down and click Show all failed tests to view the complete list of the failed tests.
Click on the test of concern for details.

Note

A test name corresponds to the test file path in the test source repository. For example, stacklight_tests.tests.prometheus.test_dashboards.
Review the lines beginning with the E letter in Stack trace.

Note

If + does not expand, use a different browser or a public URL for your Jenkins.

Add new tests to the CVP - StackLight tests pipeline¶

This section includes the instruction on how to extend the predefined StackLight LMA tests.

Note

Starting from the MCP 2019.2.4 maintenance update, the CVP - Stacklight test does not depend on the external repository, adding custom StackLight LMA tests is deprecated.

To add custom StackLight LMA tests prior to the MCP 2019.2.4 update:

Clone the Mirantis/stacklight-pytest Git repository to a local Gerrit repository.
Check out the stacklight-pytest repository from you local Gerrit.
If required, create a new directory under stacklight-pytest/stacklight_tests/tests/.
Create the test_<test_file_name>.py file where the test_ prefix is mandatory.
Add the test code.
Commit your changes to the local Gerrit.
To perform the newly added test, run the CVP - StackLight tests pipeline as described in Execute the CVP - StackLight tests pipeline, specifying <default_path>/<new_folder_name>/test_<test_file_name> in the TESTS_SET field.

Note

Verify that the TESTS_SET value matches your local Gerrit copy of the stacklight-pytest repository.

Perform the data plane networking testing with CVP Shaker¶

CVP Shaker verifies and measures the performance of the data plane networking of your MCP OpenStack deployment. This test suite is based on Shaker that is a wrapper around popular system network testing tools such as iperf, iperf3, and netperf.

To perform the OpenStack data plane networking test of your environment, use the CVP - Shaker network tests Jenkins pipeline.

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Execute the CVP - Shaker network tests pipeline¶

This section instructs you on how to perform the data plane networking test of your deployment using the CVP - Shaker network tests Jenkins pipeline.

Note

Clone the cvp-shaker repository to your local Gerrit and use it locally to add new or adjust the existing scenarios and build a new Docker image with Shaker.

To perform the data plane networking test of your deployment:

In a web browser, open http://<ip_address>:8081 to access the Jenkins web UI.

Note

The IP address is defined in the classes/cluster/<cluster_name>/cicd/init.yml file of the Reclass model under the cicd_control_address parameter variable.
Log in to the Jenkins web UI as Administrator.
Note

To get the password, execute the following command on the Salt Master node:
```
salt-call pillar.data _param:jenkins_admin_password
```
In the global view, find the CVP - Shaker network tests pipeline.
Select the Build with Parameters option from the drop-down menu of the pipeline.

Configure the following parameters as required:

**CVP - Sanity checks parameters**¶
Parameter	Description
IMAGE	The `cvp-shaker` Docker image (with all dependencies) that will be used during the test run. The default value is `docker-prod-local.docker.mirantis.net/mirantis/cvp/cvp-shaker:<MCP_VERSION>`. For the offline mode, use the URL from the local artifactory or offline image.
SALT_MASTER_CREDENTIALS	The credentials to Salt API stored in Jenkins, included by default. See View credentials details used in Jenkins pipelines for details.
SALT_MASTER_URL	The reachable IP address of the Salt Master node and port on which Salt API listens. For example, `http://172.18.170.28:6969`. To determine on which port Salt API listens: Log in to the Salt Master node. Search for the port in the `/etc/salt/master.d/_api.conf` file. Verify that the Salt Master node is listening on that port: netstat -tunelp \| grep <PORT>
SHAKER_PARAMS	The YAML context with parameters for running Shaker. See the description of the available options below. The `SHAKER_SERVER_ENDPOINT` option is mandatory, while others can be left with the default values.
SHAKER_PARAMS:SHAKER_SERVER_ENDPOINT	The address for the Shaker server connections in the form of `host:port`. The address should be accessible from the OpenStack public (floating) network and usually equals to a public-routable address of the CI/CD node. This is a mandatory option. Caution The Shaker server address should belong to the CI/CD node that you start the job at and is accessible from the Openstack public (floating) network. Also, to be able to run some scenarios, you should provide compute nodes with the 4*(density count) gigabytes of free disk space.
SHAKER_PARAMS:SHAKER_SCENARIOS	Path to the Shaker scenarios in the `cvp-shaker` Docker image. Can be either a directory or a specific file. The main categories include: `scenarios/essential/l2` `scenarios/essential/l3` `scenarios/additional/cross_az` `scenarios/additional/external` `scenarios/additional/qos` The default value is `scenarios/essential` that starts a comprehensive test of both L2 and L3 data plane networking.
SHAKER_PARAMS:SKIP_LIST	A comma-separated list of Shaker scenarios to skip directories or files inside the `scenarios` directory of the `cvp-shaker` Docker image. For example, `dense_l2.yaml,full_l2.yaml,l3`. Defaults to an empty string.
SHAKER_PARAMS:MATRIX	The matrix of extra parameters for the scenario. The value is specified in the JSON format. Defaults to an empty string. For example, to override a scenario duration, specify `{time: 10}`, or to override the list of hosts, define `{host:[ping.online.net, iperf.eenet.ee]}`. If several parameters are overridden, all combinations are tested. It is a required field for some of external-category scenarios when the host name with external iPerf3 server must be provided as a command-line parameter, for example, `{host: 10.13.100.4}`.
SHAKER_PARAMS:IMAGE_BUILDER	The list of the `shaker-image-builder` environment variables that includes: `SHAKER_FLAVOR_DISK` `SHAKER_FLAVOR_RAM` `SHAKER_FLAVOR_VCPUS` `SHAKER_IMAGE_BUILDER_MODE` Used for building an image of Shaker which will be used for running Shaker agents across the cluster. Leave these parameters commented out as the default settings should meet all the requirements for starting a test.
SHAKER_PARAMS:SHAKER	List of Shaker server environment variables including: `SHAKER_AGENT_JOIN_TIMEOUT` `SHAKER_AGENT_LOSS_TIMEOUT` `SCENARIO_AVAILABILITY_ZONE` `SCENARIO_COMPUTE_NODES` `SHAKER_EXTERNAL_NET` You can define these parameters to alter the Shaker server environment variables. The `SHAKER_EXTERNAL_NET` variable should be set to the name of your OpenStack floating network, which defaults to `public`. The `SCENARIO_AVAILABILITY_ZONE` variable can be used when running the `cross az` scenarios category to override default availability zones set in the scenarios. All the other options in most cases can be left unchanged.

Click Build.
Verify the job status:
- GREEN, SUCCESS
  Testing has beed performed successfully, no errors found.
- RED, FAILURE
  Testing has failed due to issues with the framework or/and pipeline configuration. Review the console output.

Review the CVP Shaker network test results¶

This section explains how to review the CVP - Shaker network tests trace logs and results from the Jenkins web UI.

To review the CVP Shaker test results:

Log in to the Jenkins web UI.
Navigate to the build that you want to review.
Find the shaker-report.html at the top of the Build page.
Download the report and open it.
Click on the scenario of concern for details. A scenario name corresponds to the scenario file path in the cvp-shaker source repository. For example, the OpenStack L3 East-West scenario name corresponds to cvp-shaker/scenarios/essential/l3/full_l3_east_west.yaml
Review the performance graphs or errors that appeared during the testing.

For example:
(Optional) To view the log messages produced during the testing by Shaker, inspect shaker.log at the bottom of the Build page.

Troubleshooting¶

This section provides information on troubleshooting and known issues.

Generate a sosreport¶

Sosreport is an extensible and portable support data collection tool that creates diagnostic snapshots of the system, including the system log files and configuration details. Starting from MCP 2019.2.7 maintenance update, the tool works out of the box, does not require any mandatory configuration, and enables you to archive the obtained data and attach the archive to a Salesforce case.

Generate a sosreport prior to 2019.2.7¶

This section describes how to generate a sosreport for MCP versions prior to the MCP 2019.2.7 maintenance update. In this case, the sosreport tool works without Salt orchestration.

To generate a sosreport:

Download the latest available sosreport package for the required nodes. For example:

salt -C '<target>' cmd.run 'curl -o sosreport.deb http://mirror.mirantis.com/update/2019.2.0/extra/xenial/pool/main/s/sosreport/sosreport_<version>.deb  && dpkg -i sosreport.deb'

Run the sosreport command on the target nodes to generate a report for the entire system. For a list of possible options, run sosreport --help.

Generate a sosreport starting from 2019.2.7¶

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to generate a sosreport starting from the MCP 2019.2.7 maintenance update. The tool does not require mandatory configuration. However, you can extend the sosreport tool as required.

To generate a sosreport:

Optional. Recommended. Configure the sosreport tool using one of the following options:
- Configure the sosreport tool through the Reclass model:
  1. To save the configuration for all nodes, specify the following pillar in the cluster/<cluster_name>/infra/init.yml file. Otherwise, add the pillar to a particular node or the common component file.
    linux: system: sosreport: cmd_options: tmp-dir: /root/reportdir no_arg_opts: [ '-q' ] config_options: general: all-logs: true plugins: disabled: [ docker ] tunables: apache.log: true
    Parameters description:
    - cmd_options - defines additional arguments for a CLI and cmd call.
    - general - includes the parameters for the sos.conf general section.
    - plugins - defines the enabled or disabled plugins.
    - tunables - defines custom plugin options.
  2. Apply the changes from the Salt Master node:
    salt -C <target> saltutil.sync_all
- Configure the sosreport tool by overriding the default settings from the Salt Master node. For example:
```
salt -C '<target>' state.sls linux.system.sosreport.report pillar='{ \
"sosreport" : { "ticket-number": 12345, "tmp-dir": "/root/reportdir2" } }'
```
  For a list of possible options, run sosreport --help.
Select from the following options:
- To generate a sosreport on one or multiple target nodes:
  1. Log in to the Salt Master node.
  2. Run the following command:
    salt -C '<target>' state.sls linux.system.sosreport.report
- To generate a sosreport on a particular node:
  1. Log in to the required node.
  2. Run the following command:
    salt-call state.sls linux.system.sosreport.report

Now, proceed to Collect and archive the reports.

Collect and archive the reports¶

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Once you obtain the data from the required nodes as described in Generate a sosreport starting from 2019.2.7, create a common archive with the obtained data and attach it to a Salesforce case.

To collect and archive the reports:

Log in to the Salt Master node.
Run the following command specifying the options as required:
```
salt -C 'I@salt:master' state.sls linux.system.sosreport.collect pillar='{ \
"sosreport_collect" : { "target": "<target>", "archiveName": "sosreport_<env_name>_<customer_name>_<SF_ticket_ID>" } }'
```
Additionally, you can specify the following options:
- nodeIp - to use an IP from another interface on the node. The IP must be available from the Salt minions.
- port - to use NetCat in case the default 31337 port is busy.
- reportWorkDir- to specify the directory to keep all reports for a specific case.
As a result, the tool creates one common archive named sosreport_<env_name>_<customer>_<ticket>.tar.gz for all <target> nodes based on the parameters set through the model or pillar override and attaches it to the specified Salesforce ticket.

Troubleshooting¶

When collecting logs using the sosreport tool, the No space left on device while collecting plugin data exception may occur. Such exception occurs if the logs are large and indicates that the device with the temporary directory has no free space available. The temporary directory stores system reports before archiving and is set to /tmp by default.

To avoid the exception, manually change the directory through the model or CLI as described in the step 1 in Generate a sosreport starting from 2019.2.7. Additionally, you can enable a verbose output and interactive debugging using the Python debugger. For details, run sosreport --help.

Troubleshoot DriveTrain¶

This section instructs you on how to troubleshoot the DriveTrain issues.

Glusterd service restart failure
Failure to connect to Jenkins master
NGINX gateway timeout during the DriveTrain upgrade
SaltReqTimeoutError during the DriveTrain upgrade

Glusterd service restart failure¶

[32334] The Glusterd service does not restart automatically after its child processes failed or were unexpectedly killed.

To troubleshoot the issue:

Log in to a KVM node.
In the /lib/systemd/system/glusterd.service file, set the Restart option in the [Service] section:
```
[Service]
...
Restart=on-abort
...
```
The recommended values include:
- on-abort
  The service restarts only if the service process exits due to an uncaught signal not specified as a clean exit status.
- on-failure
  The service restarts when the process exits with a non-zero exit code, is terminated by a signal including on core dump and excluding the aforementioned four signals, when an operation such as service reload times out, and when the configured watchdog timeout is triggered.
Apply the changes:
```
systemctl daemon-reload
```
Note

Re-apply the provided workaround if any of the GlusterFS packages has been re-installed or upgraded.
Perform the steps above on the remaining KVM nodes.

Failure to connect to Jenkins master¶

[34848] Jenkins slaves may be unable to connect to Jenkins master during the update of MCP versions prior to 2019.2.4 to MCP maintenance update 2019.2.8 or newer. The issue may be related to the Cross Site Request Forgery (CSRF) protection configuration.

To verify whether your deployment is affected:

From the Salt Master node, obtain the Jenkins credentials:

salt -C 'I@jenkins:client and not I@salt:master' config.get jenkins

Run the following command:

curl -u <jenkins_admin_user>:<jenkins_admin_password> '<jenkins_url>//crumbIssuer/api/xml?xpath=concat(//crumbRequestField,":",//crumb)'

If the system response is 404 Not found, proceed with the issue resolution below.

To apply the issue resolution:

Log in to the Jenkins web UI.
Navigate to Manage Jenkins > Configure Global Security.
Under CSRF Protection, select Prevent Cross Site Request Forgery exploits and Default Crumb Issuer.
Click Save.

Once done, Jenkins slaves automatically reconnect in a few seconds.

NGINX gateway timeout during the DriveTrain upgrade¶

[34798] The Deploy - upgrade MCP DriveTrain Jenkins pipeline job may fail with the Error with request: HTTP Error 504: Gateway Time-out error message. The issue may occur in huge environments when applying Salt states on all nodes due to a small NGINX timeout configured on the Salt Master node.

To apply the issue resolution:

Select one of the following options:

In the Deploy - upgrade MCP DriveTrain Jenkins pipeline job, set the SALT_MASTER_URL parameter to the Salt API endpoint http://<cfg_node_ip>:6969.
Increase the NGINX timeout:
1. Open your Git project repository with the Reclass model on the cluster level.
2. In classes/cluster/<cluster_name>/infra/config/init.yml, increase the NGINX timeout for the Salt API site. By default, the timeout is set to 600 seconds.
```
parameters:
  nginx:
    server:
      site:
        nginx_proxy_salt_api:
          proxy:
            timeout: <timeout>
```
3. Refresh Salt pillars and apply the nginx state on Salt Master node:
```
salt -C 'I@salt:master' saltutil.refresh_pillar
salt -C 'I@salt:master' state.apply nginx
```
4. Commit the changes to your local repository.

SaltReqTimeoutError during the DriveTrain upgrade¶

[34114] The Deploy - upgrade MCP DriveTrain Jenkins pipeline job may fail with the SaltReqTimeoutError in master zmq thread error message when executing the salt.minion state on several minions at the same time. Adjust the Salt Master configuration to improve its performance.

To adjust the Salt Master configuration:

Open your Git project repository with the Reclass model on the cluster level.
In cluster/<cluster_name>/infra/config/init.yml, increase the values for the following parameters as required:
gather_job_timeout
The number of seconds to wait for the client to request information about the running jobs.

worker_threads
The number of threads to start to receive commands and replies from minions.

sock_pool_size
The pool size of Unix sockets. To avoid blocking waiting while writing data to a socket, a socket pool is supported for Salt applications. For example, a job or state with a large number of target host list can cause a long period of blocking waiting.

zmq_backlog
The number of messages in the ZeroMQ backlog queue.

For example:
parameters: salt: master: worker_threads: 40 opts: gather_job_timeout: 100 sock_pool_size: 15 zmq_backlog: 3000
Log in to the Salt Master node.

Refresh Salt pillars and apply the salt.master state.

salt-call saltutil.refresh_pillar
salt-call state.apply salt.master

Commit the changes to your local repository.

Troubleshoot an MCP OpenStack environment¶

This section describes procedures helping to troubleshoot problems that may occur in an MCP OpenStack environment due to restarting the cloud environment after a power outage, for example.

If your MCP OpenStack environment is OpenContrail-based, refer to Troubleshoot OpenContrail to troubleshoot networking.

Troubleshoot the system¶

This section describes how to verify the Linux system status.

To perform the Linux status check:

Verify that there is enough space on the disk:

df -h

Example system response:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G  4.5G   15G  25% /
tmpfs           939M  328K  939M   1% /dev/shm
/dev/sdd1       3.9G   11M  3.7G   1% /tmp
/dev/sdc1       9.8G  102M  9.2G   2% /var/log
/dev/sde        3.9G   34M  3.6G   1% /var/log/audit
tmpfs           512M     0  512M   0% /config

Check the available inodes:

df -i

Example system response:

Filesystem       Inodes  IUsed    IFree IUse% Mounted on
/dev/sda1      14589952 387481 14202471    3% /
none            2038052     13  2038039    1% /sys/fs/cgroup
udev            2035377    518  2034859    1% /dev
tmpfs           2038052    591  2037461    1% /run
none            2038052      5  2038047    1% /run/lock
none            2038052    118  2037934    1% /run/shm
none            2038052     35  2038017    1% /run/user

Verify that there is enough amount of memory:

free m

Example system response:

             total       used       free     shared    buffers     cached
Mem:       1922208    1839352      82856        500     415068     525944
-/+ buffers/cache:     898340    1023868
Swap:      4193276       1544    4191732

Verify if there are no processes that use all available memory/CPU:

htop

Example system response:

top - 11:46:28 up 40 days, 22:35,  3 users,  load average: 0.00, 0.00, 0.03
Tasks: 218 total,   1 running, 217 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.8%us,  1.5%sy,  0.0%ni, 95.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1922208k total,  1829060k used,    93148k free,   415108k buffers
Swap:  4193276k total,     1544k used,  4191732k free,   526056k cached

Troubleshoot the OpenStack services¶

Depending on your needs, inspect the OpenStack services logging information:

On the OpenStack controller nodes:

tail -fn0 /var/log/{nova,cinder,glance,keystone,neutron,contrail,cassandra,rabbitmq}/*.log \
| egrep 'ERROR|WARNING|REFUSED|EXCEPTION|TRACE|error'

On the OpenStack compute nodes:

tail -fn0 /var/log/{nova,contrail}/*.log | \
egrep 'ERROR|TRACE|WARNING|REFUSED|EXCEPTION|error|SHUTDOWN'

To ensure that an OpenStack service is up and running, verify the service status on every controller node. Some OpenStack services require additional verification on the OpenStack non-controller nodes.

Before you can proceed with troubleshooting, export all credentials from keystonerc file on each controller node to be able to manage your OpenStack environment:

**Verifying the OpenStack services status**¶
Service name	Verification procedure
Glance	Use the glance image-list command. The output should contain the table with the images list. The images status should be `active`.
Nova	Use the nova service-list. The output should contain the table with the Nova services list. The services status should be `enabled`, their state should be `up`.
Cinder	Use the cinder-manage service list. The output should contain the table with the Cinder services list. The services status should be `enabled`.

Troubleshoot RabbitMQ¶

To troubleshoot RabbitMQ:

Log in to any OpenStack messaging node.
Verify the RabbitMQ cluster status:
```
rabbitmqctl cluster_status
```

Verify the AMQP messaging RabbitMQ status:

curl -s -o rabbitmqadmin http://$(hostname -s):15672/cli/rabbitmqadmin;
chmod 0755 ./rabbitmqadmin
adm_creds=$(salt-call pillar.get rabbitmq:server:admin --out=yaml \
  | egrep ':\ (.+)' | tr '\n' ' ' | sed 's/name: /-u /; s/password: /-p /')
./rabbitmqadmin -H $(hostname -s) $adm_creds show overview
./rabbitmqadmin -H $(hostname -s) $adm_creds list queues
./rabbitmqadmin -H $(hostname -s) $adm_creds list queues vhost name node messages \
  message_stats.publish_details.rate | grep -v " 0"
./rabbitmqadmin -H $(hostname -s) $adm_creds -V /openstack list queues vhost name \
node messages message_stats.publish_details.rate | grep -v " 0"

List the number of messages and consumers for each queue. Select vhost, otherwise, you will work with the default queue that is /. The number of messages should be low and number of consumers above 0:
```
rabbitmqctl list_queues -p /openstack messages consumers name
```

Print stuck nodes if any:

rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Troubleshot RabbitMQ non-functional queue bindings¶

Caution

The following procedure has been tested with clustered RabbitMQ series 3.8. For the RabbitMQ series 3.6, as well as for a nonclustered RabbitMQ, you may need to adjust the procedure.

Note

This feature is available starting from the MCP 2019.2.15 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

In case of network hiccups or instability, a RabbitMQ cluster may not pass messages between some exchanges and their respective queues even if the corresponding bindings are present and visible in the RabbitMQ Management Plugin UI. Due to concurrent handling of self-healing actions and OpenStack clients recreating queues during reconnection, some bindings that route messages from exchanges to queues may be created dysfunctional. This issue also applies to queue policies. For example, queues can be created without the needed policies, which can lead to an unexpected behavior.

This section describes how to identify and fix dysfunctional bindings. The solution includes a helper script that sends test messages to exchanges with the identified bindings. If a message cannot be delivered with a particular binding, the script prints out a message about a potentially dysfunctional binding.

Warning

Perform the following procedure only if the RabbitMQ cluster is healthy, the self-healing is over, and cluster_status indicates that the cluster is running all the nodes it was configured with. To verify the RabbitMQ cluster health, run the following command:

rabbitmqctl cluster_status

If the output indicates that the cluster is running less nodes than it was configured with and shows active partitions that will not be dismissed, or if the command errors out showing that the RabbitMQ process is not running, proceed with Restart RabbitMQ with clearing the Mnesia database instead.

To identify and fix RabbitMQ dysfunctional queues:

Download the script:

wget "https://review.fuel-infra.org/gitweb?p=tools/sustaining.git;a=blob_plain;f=scripts/rabbitmq_check.py;hb=HEAD" -O ./rabbitmq_check.py

Obtain the RabbitMQ admin credentials:

salt-call -l quiet pillar.items rabbitmq:server:admin

Obtain the IP address where the RabbitMQ Management Plugin is listening for incoming connections:
```
ss -lnt '( sport = :15672 )'
```
Run the script replacing %IP% and %PASSWORD% with the previously obtained parameters:
```
python3 ./rabbitmq_check.py -u admin -p %PASSWORD% -H %IP% -V "/openstack" check
```
If the script returns output with Possibly unreachable binding […] …, discrepancies have been found. Proceed with the steps below.
Restart the service that runs the particular exchange-queue pair. For example, if a conductor binding is faulty, restart the nova-conductor service on all OpenStack controller nodes.
If the service restart did not fix the issue, drop the state of exchanges and queues:
1. Log in to the Salt Master node.
2. Disconnect the clients from RabbitMQ to remove the interference with inter-cluster operations:
```
salt -C 'I@rabbitmq:server' cmd.run 'iptables -I INPUT -p tcp --dport 5671:5672 -j REJECT'
```
3. Stop the RabbitMQ application without shutting down the entire Erlang machine:
```
salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl stop_app'
```
4. Wait for 1-2 minutes and then start the RabbitMQ application:
```
salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl start_app"'
```
5. Verify that the RabbitMQ cluster is healthy:
```
salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl cluster_status'
```
6. Remove the iptables rules to allow clients to connect back to RabbitMQ:
```
salt -C 'I@rabbitmq:server' cmd.run 'iptables -D INPUT -p tcp --dport 5671:5672 -j REJECT'
```

If none of the approaches fixes the issue, proceed with Restart RabbitMQ with clearing the Mnesia database.

Troubleshoot MongoDB¶

This section instructs you on how to troubleshoot the MongoDB failures.

To troubleshoot MongoDB:

mongo -uadmin -p$(salt-call pillar.get mongodb:server:admin:password|tail -1|tr -d ' ') admin

Verify the replica set configuration and status:
```
rs.conf()
rs.status()
```

To force synchronization of one of the replicas, delete data in the replica directory and start MongoDB:

Caution

Proceed with caution to avoid any data loss.

service mongodb stop
mv /var/lib/mongodb /var/lib/mongodb.backup
mkdir /var/lib/mongodb
chown mongodb:mongodb /var/lib/mongodb
service mongodb start

Troubleshoot a Galera cluster¶

A Galera cluster is a true synchronous multi-master replication system. Any or all nodes composing the cluster can be used as master at any time meaning that there is no failover in the traditional MySQL master-slave sense. Though, you may encounter issues with your Galera cluster after restarting the cluster during maintenance or upgrade.

This section instructs you on how to troubleshoot the Galera cluster as well as how to restart, restore, and rejoin nodes to the cluster.

Verify a Galera cluster status¶

You can verify the status of a MySQL Galera cluster either manually from the Salt Master node or automatically using Jenkins.

To verify a MySQL Galera cluster status, select from the following options:

Verify the Galera cluster status automatically using the Verify and Restore Galera cluster Jenkins pipeline as desribed in Restore a Galera cluster and database automatically.

Verify the Galera cluster status manually:

Apply the following state:

salt -C 'I@galera:master' mysql.status | \
grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'

Example of system response:

wsrep_cluster_size:
    3

wsrep_incoming_addresses:
    192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306

wsrep_local_state_comment:
    Synced

Restore a Galera cluster¶

This section instructs you on how to restore the Galera cluster either using the dedicated Jenkins pipeline or manually. The automatic restoration procedure included in this section presupposes that only 1 Galera node is down or the data is corrupted. If you experience a loss of multiple nodes, apply the manual procedure adjusted to the needs of your deployment.

Prepare for a Galera cluster restoration¶

The restoration of the Galera cluster is fully dependent on the correct Xtrabackup configuration. The service misconfiguration or skipping the data backup stage may lead to the restoration failure.

To prepare for the Galera cluster restoration:

Log in to the Salt Master node.
Configure Xtrabackup by setting the parameters as required in cluster/openstack/database/init.yml. For example:

Note

You will need to revert the Xtrabackup configuration after the Galera cluster is restored.
```
parameters:
  xtrabackup:
    client:
      enabled: true
      restore_full_latest: 1
      restore_from: remote
```
The available values for the configurable parameters include:
- restore_full_latest can have the following values:
  
  1
  Restore the database from the last complete backup and its increments.
  
  2
  Restore the database from the second latest complete backup and its increments.
- restore_from can have the following values:
  
  local
  Restore from the local storage
  
  remote
  Use scp to get the files from the xtrabackup server.
Create a data backup using the Xtrabackup service as described in Create an instant backup of a MySQL database. The backup data will be used after the restart is completed.
Proceed with one of the following options:
- Restore a Galera cluster and database automatically
- Restore a Galera cluster manually

Restore a Galera cluster and database automatically¶

The Galera cluster ensures that the OpenStack services are operable. In case of a cluster outage, the number of manual steps to start the cluster, as well as ensuring the necessary access can significantly delay the restoration of services and is prone to operator errors. Therefore, to reduce the complexity of the procedure and support greater scalability, MCP provides the automatic way to verify and restore the Galera cluster in your deployment.

This section describes how to verify the status of a Galera cluster and restore it using the Verify and Restore Galera cluster Jenkins pipeline. Use the automatic restoration procedure only if 1 Galera node is down or the data is corrupted. Otherwise, apply the manual procedure adjusted to the needs of your deployment as described in Restore a Galera cluster manually.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Note

The Verify and Restore Galera cluster Jenkins pipeline restores the Galera cluster with the provided configuration and does not fix the issues caused by cluster misconfiguration.

To restore the Galera cluster and database automatically:

Log in to the Jenkins web UI.
Open the Verify and Restore Galera cluster pipeline.

Specify the required parameters:

Parameter	Description and values
SALT_MASTER_URL	Add the IP address of your Salt Master node host and the `salt-api` port. For example, `http://172.18.170.27:6969`.
CREDENTIALS_ID	Add `credentials_id` as credentials for the connection.
RESTORE_TYPE	Check `ONLY_RESTORE` if manual backup has been performed already. The created backup will be used during the restoration. Check `BACKUP_AND_RESTORE` if backup has not been performed and is required to be performed during the pipeline run.
ASK_CONFIRMATION	Set to `False` if you do not want the pipeline to wait for a manual confirmation before running the restoration. Defaults to `True`.
CHECK_TIME_SYNC	Set to `False` if you do not want the pipeline to verify the time synchronization accross the nodes. Defaults to `True`.
VERIFICATION_RETRIES	Specify the number of retries of the verification process after the restoration was performed. The value should be increased for the bigger clusters as it may take more time for such clusters to come up and synchronize. Defaults to `5`.

Click Deploy.

The pipeline workflow:

The verification stage:

Obtaining and parsing the result of the mysql.status call.
Formatting and printing a result report to the user.

Example of a verification report:

 CLUSTER STATUS REPORT: 6 expected values, 0 warnings and 1 error found:

 [OK     ] Cluster status: Primary (Expected: Primary)
 [OK     ] Master node status: true (Expected: ON or true)
 [OK     ] Master node status comment: Synced (Expected: Joining or Waiting on SST
           or Joined or Synced or Donor)
 [OK     ] Master node connectivity: true (Expected: ON or true)
 [OK     ] Average size of local reveived queue: 0.166667 (Expected: below 0.5)
           (Value above 0 means that the node cannot apply write-sets as fast
           as it receives them, which can lead to replication throttling)
 [OK     ] Average size of local send queue: 0.010204 (Expected: below 0.5)
           (Value above 0 indicate replication throttling or network throughput
           issues, such as a bottleneck on the network link.)

 [  ERROR] Current cluster size: 2 (Expected: 3)

 Errors found.

There's something wrong with the cluster, do you want to run a restore?

 Are you sure you want to run a restore? Click to confirm
 Proceed or Abort

Optional. The backup stage:

Running the Galera database backup pipeline. For the pipeline workflow, see Create an instant backup of a MySQL database automatically.
The restoration stage:
1. If Proceed is selected, the restoration stage will continue. Otherwise, it will abort.
2. The last shutdown node will be used as a source of truth.
The verification stage:

Verifying the status of the cluster.

After the restoration is finalized, verify that all nodes are back and the cluster is working.
Revert the changes made in the cluster/openstack/database/init.yml file in the step 2 during Prepare for a Galera cluster restoration.

Restore a Galera cluster manually¶

As the manual procedure of the Galera cluster restoration is prone to operator errors, we recommend restoring the cluster using the Jenkins pipeline as described in Restore a Galera cluster and database automatically. Though, if you cannot apply the automatic procedure to your deployment for some reason, use the manual instruction included in this section.

Note

To restore a Galera database using the Jenkins pipeline, see Restore a Galera cluster and database automatically.

To restore a Galera cluster manually:

On all Galera dbs nodes:
1. Stop all the MySQL processes.
2. Verify that the MySQL processes are stopped:
```
ps aux | grep mysql
```
3. Identify the last shutdown Galera node:
  1. In the /var/lib/mysql/grastate.dat file on every Galera node, compare the seqno value. The Galera node that contains the maximum seqno value is the last shutdown node.
  2. If the seqno value is equal on all three nodes, identify the node on which the /var/lib/mysql/gvwstate.dat file exists. The Galera node that contains this file is the last shutdown node.
4. Remove the grastate.dat and ib_logfiles:
```
rm /var/lib/mysql/grastate.dat
rm /var/lib/mysql/ib_logfile*
```
Log in to the last shutdown Galera node.
In /etc/mysql/my.cnf:
1. Comment out the wsrep_cluster_address line:
```
...
#wsrep_cluster_address="gcomm://192.168.0.1,192.168.0.2,192.168.0.3"
...
```
2. Add the wsrep_cluster_address parameter without any IP address specified.
```
...
wsrep_cluster_address=gcomm://
...
```
Start MySQL:
```
service mysql start
```

Validate the current status of the Galera cluster:

salt-call mysql.status | grep -A1 wsrep_cluster_size

Start MySQL on the second Galera node. Wait until the node re-joins the cluster.
When the cluster size equals to two, start MySQL on the third node.
Verify the MySQL status. The cluster size should be equal to three:
```
salt-call mysql.status | grep -A1 wsrep_cluster_size
```
Log in to the last shutdown Galera node.
In /etc/mysql/my.cnf:
1. Uncomment the line with the wsrep_cluster_address parameter:
```
...
wsrep_cluster_address="gcomm://192.168.0.1,192.168.0.2,192.168.0.3"
...
```
2. Remove the wsrep_cluster_address parameter without any IP address specified.

Verify that the /etc/salt/.galera_bootstrap file exists on every dbs node. Otherwise, create one:

touch /etc/salt/.galera_bootstrap
rw-rr- 1 root root 0 Feb 12 08:31 /etc/salt/.galera_bootstrap

Revert the changes made in the cluster/openstack/database/init.yml file in the step 2 during Prepare for a Galera cluster restoration.

Restart a Galera cluster¶

This section instructs you on how to restart the whole Galera cluster. You may need to restart the whole cluster if all three Galera nodes fail to start.

Before you restart the Galera cluster, create a data backup using the Xtrabackup service as described in Create an instant backup of a MySQL database.

Caution

We recommend that you always back up your MySQL database before performing any changes to the Galera cluster, even if you do not plan to restore the backup data afterwards.

To restart the Galera cluster, use Restore a Galera cluster manually.

Rejoin a MySQL node¶

The MySQL Galera cluster contains three nodes. If one node fails, the MySQL database does not have an outage. Solve the problem by restarting the mysql service on the failed node. If the node cannot be rejoined to the cluster after restart, remove the following files from the /var/lib/mysql/ directory and restart the mysql service again:

rm -rf /var/lib/mysql/grastate*
rm -rf /var/lib/mysql/ib_log*
service mysql start

Troubleshoot GlusterFS¶

This section describes how to verify the GlusterFS services status and troubleshoot the GlusterFS-related issues if any.

Caution

If you need to reboot the kvm nodes that host the GlusterFS cluster, do not proceed with rebooting the next kvm node before you make sure that the replication status of the GlusterFS services and volumes is healthy after the rebooting of the first kvm node.

Verify a GlusterFS cluster status¶

If you update or upgrade your MCP cluster, the kvm nodes that usually host the GlusterFS cluster require a reboot for the updates to be applied. In such cases, you have to verify the GlusterFS cluster status after reboot of every node before rebooting the next kvm node.

To verify a GlusterFS cluster status:

Verify that the GlusterFS server status is healthy:

salt -C 'I@glusterfs:server' cmd.run "gluster peer status"

Example of system response:

kvm01.cookied-cicd-bm-os-contrail40-maas.local:
Number of Peers: 2

Hostname: 10.167.8.243
Uuid: 5ac8ee70-40a3-44c4-8ec8-967a0584a0d4
State: Peer in Cluster (Connected)

Hostname: 10.167.8.242
Uuid: 8ce5fd88-2cae-47b5-a397-447874686406
State: Peer in Cluster (Connected)

kvm02.cookied-cicd-bm-os-contrail40-maas.local:
Number of Peers: 2

Hostname: 10.167.8.243
Uuid: 5ac8ee70-40a3-44c4-8ec8-967a0584a0d4
State: Peer in Cluster (Connected)

Hostname: kvm01.cookied-cicd-bm-os-contrail40-maas.local
Uuid: 6bf7eccb-69a2-42bd-b033-50d683120130
State: Peer in Cluster (Connected)

kvm03.cookied-cicd-bm-os-contrail40-maas.local:
Number of Peers: 2

Hostname: 10.167.8.242
Uuid: 8ce5fd88-2cae-47b5-a397-447874686406
State: Peer in Cluster (Connected)

Hostname: kvm01.cookied-cicd-bm-os-contrail40-maas.local
Uuid: 6bf7eccb-69a2-42bd-b033-50d683120130
State: Peer in Cluster (Connected)

Verify that the GlusterFS client status is healthy:

salt -C 'I@glusterfs:client and not I@glusterfs:server' test.ping

Example of system response:

cid03.cookied-cicd-bm-os-contrail40-maas.local:
    True
cid02.cookied-cicd-bm-os-contrail40-maas.local:
    True
cid01.cookied-cicd-bm-os-contrail40-maas.local:
    True
ctl03.cookied-cicd-bm-os-contrail40-maas.local:
    True
prx02.cookied-cicd-bm-os-contrail40-maas.local:
    True
prx01.cookied-cicd-bm-os-contrail40-maas.local:
    True
ctl01.cookied-cicd-bm-os-contrail40-maas.local:
    True
ctl02.cookied-cicd-bm-os-contrail40-maas.local:
    True

If any of the above commands fail, refer to the GlusterFS official documentation to troubleshoot the required services.
Verify the GlusterFS volumes status as described in Verify the GlusterFS volumes status.

Verify the GlusterFS volumes status¶

This section describes how to verify the status of the GlusterFS volumes and troubleshoot issues if any.

To verify the GlusterFS volumes status:

Verify the GlusterFS volumes status:

salt -C 'I@glusterfs:server' cmd.run "gluster volume status all"

If the system output contains issues, such as in the example below, or/and the volume status cannot be retrieved, refer to the GlusterFS official documentation to resolve the issues.

Example of system response:

kvm01.cookied-cicd-bm-os-contrail40-maas.local:
Another transaction is in progress for aptly. Please try again after sometime.
Another transaction is in progress for gerrit. Please try again after sometime.
Another transaction is in progress for keystone-credential-keys. Please try again after sometime.
Another transaction is in progress for mysql. Please try again after sometime.
Another transaction is in progress for registry. Please try again after sometime.
Locking failed on 10.167.8.243. Please check log file for details.

Inspect the GlusterFS server logs for volumes at /var/log/glusterfs/bricks/srv-glusterfs-<volume name>.log on the kvm nodes.
In case of any issues with the replication status of GlusterFS, stop all the GlusterFS volume-related services to prevent data corruption and immediately proceed with a troubleshooting to restore the volume in question.

If you need to reboot a kvm that hosts GlusterFS, verify that all GlusterFS clients have volumes mounted after a node reboot:

Identify the GlusterFS VIP address:

salt-call pillar.get

Example of system response:

_param:infra_kvm_address
local:
    10.167.8.240

Verify that all GlusterFS clients have volumes mounted. For example:

salt -C 'I@glusterfs:client and not I@glusterfs:server' cmd.run "mount | grep 10.167.8.240"

Example of system response:

prx02.cookied-cicd-bm-os-contrail40-maas.local:
    10.167.8.240:/salt_pki on /srv/salt/pki type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
ctl01.cookied-cicd-bm-os-contrail40-maas.local:
    10.167.8.240:/keystone-credential-keys on /var/lib/keystone/credential-keys type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
    10.167.8.240:/keystone-keys on /var/lib/keystone/fernet-keys type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
cid03.cookied-cicd-bm-os-contrail40-maas.local:
prx01.cookied-cicd-bm-os-contrail40-maas.local:
    10.167.8.240:/salt_pki on /srv/salt/pki type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
cid01.cookied-cicd-bm-os-contrail40-maas.local:
cid02.cookied-cicd-bm-os-contrail40-maas.local:
ctl03.cookied-cicd-bm-os-contrail40-maas.local:
cfg01.cookied-cicd-bm-os-contrail40-maas.local:
    10.167.8.240:/salt_pki on /srv/salt/pki type fuse.glusterfs
    (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
ctl02.cookied-cicd-bm-os-contrail40-maas.local:

In the example above, several VMs do not have the GlusterFS volumes mounted, for example, cid03, cid01, cid02. In such case, reboot the corresponding VMs and verify the status again.

Verify that all mounted volumes identified in the previous step match the pillar information on the corresponding VM. For example:

salt prx02.cookied-cicd-bm-os-contrail40-maas.local pillar.get glusterfs:client:volumes

Example of system response:

----------
    salt_pki:
    ----------
    opts:
        defaults,backup-volfile-servers=10.167.8.241:10.167.8.242:10.167.8.243
    path:
        /srv/salt/pki
    server:
        10.167.8.240

In the output above, the server IP address and the only mounted salt_pki volume information match the output for the prx02 VM shown in the previous step.

Caution

Do not proceed to reboot the next kvm node that hosts the GlusterFS cluster before all volumes are mounted on VMs of the first kvm node.

Troubleshoot an instance failure¶

Depending on your needs, debug an instance failure examining the following log files:

/var/log/nova/nova-scheduler.log
/var/log/nova/nova-compute.log

To delete the instance in the error state:

nova reset-state
nova reset-state --active
nova delete

Increase the size limit for uploading a Glance image¶

The maximum size of a Glance image is limited to 30 GB on the system level of the Reclass model in /nginx/server/proxy/openstack/glance.yml. Due to this limitation, you may receive the 500 Internal Server Error from NGINX once the upload size of a Glance image reaches 30 GB. Also, in such cases, the prx node may fail if the disk size is less than 30 GB.

To increase the upload size limit of a Glance image:

Open your Git project repository with the Reclass model on the cluster level.

In openstack/proxy.yml, add the following parameters under nginx:server:site:

...
nginx_proxy_openstack_api_glance:
  proxy:
    request_buffer: false
    size: 100000m

Apply the following state:

salt -C 'I@nginx:server' state.sls nginx.server

Clearing ephemeral LVM volumes using shred consumes huge hardware resources¶

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To prevent excessive disk consumption while clearing ephemeral LVM volumes using shred, you can set the ionice level for the ephemeral LVM volume shred operation in nova-compute.

Setting of the ionice level described below makes sense if:

nova:compute:lvm:ephemeral is set to True
nova:compute:lvm:volume_clear is set to zero or shred

To set the ionice level:

Log in to the Salt Master node.
In classes/cluster/<cluster_name>/openstack/compute.yml, set the level for volume_clear_ionice_level as required:
```
nova:
  compute:
    lvm:
      volume_clear_ionice_level: <level>
```
Possible <level> values are as follows:
- idle - to use the idle scheduling class. This option impacts system performance the least with a downside of increased time for a volume clearance.
- From 0 to 7 - to use the best-effort scheduling class. Set the priority level to the specified number.
- No value - not to set the I/O scheduling class explicitly. Mirantis does not recommend using no value since this is the most aggressive option in terms of system performance impact.

Apply the changes:

salt -C 'I@nova:compute' state.sls nova.compute

Insufficient OVS timeouts causing instance traffic losses¶

If you receive the OVS timeout errors in the neutron-openvswitch-agent logs, such as ofctl request <...> timed out: Timeout: 10 seconds or Commands [<ovsdbap...>] exceeded timeout 10 seconds, you can configure the OVS timeout parameters as required depending on the number of the OVS ports on the gtw in your cloud. For example, if you have more than 1000 ports per a gtw node, Mirantis recommends changing the OVS timeouts as described below. The same procedure can be applied to the compute nodes if required.

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To increase OVS timeouts on the gateway nodes:

Log in to the Salt Master node.
Open /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/gateway.yml for editing.

Add the following snippet to the parameters section of the file with the required values.

neutron:
 gateway:
   of_connect_timeout: 60
   of_request_timeout: 30
   ovs_vsctl_timeout: 30  # Pike
   ovsdb_timeout: 30  # Queens and beyond

Apply the following state:

salt -C 'I@neutron:gateway' state.sls neutron

Verify whether the Open vSwitch logs contain the Datapath Invalid and no response to inactivity probe errors:

In the neutron-openvswitch-agent logs, for example:

ERROR ... ofctl request <...> error Datapath Invalid 64183592930369: \
InvalidDatapath: Datapath Invalid 64183592930369

In openvswitch/ovs-vswitchd.log, for example:

ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe \
after 5 seconds, disconnecting

If the logs contain such errors, increase inactivity probes for the OVS bridge controllers and OVS manager:

Run the following commands:

ovs-vsctl set controller br-int inactivity_probe=60000
ovs-vsctl set controller br-tun inactivity_probe=60000
ovs-vsctl set controller br-floating inactivity_probe=60000

Identify the OVS manager ID:
```
ovs-vsctl list manager
```

Run the following command:

ovs-vsctl set manager <ovs_manager_id> inactivity_probe=30000

To increase OVS timeouts on the compute nodes:

Log in to the Salt Master node.
Open /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/compute.yml for editing.

Add the following snippet to the parameters section of the file with the required values.

neutron:
 compute:
   of_connect_timeout: 60
   of_request_timeout: 30
   ovs_vsctl_timeout: 30  # Pike
   ovsdb_timeout: 30  # Queens and beyond

Apply the following state:

salt -C 'I@neutron:compute' state.sls neutron

Verify whether the Open vSwitch logs contain the Datapath Invalid and no response to inactivity probe errors:

In the neutron-openvswitch-agent logs, for example:

ERROR ... ofctl request <...> error Datapath Invalid 64183592930369: \
InvalidDatapath: Datapath Invalid 64183592930369

In openvswitch/ovs-vswitchd.log, for example:

ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe \
after 5 seconds, disconnecting

If the logs contain such errors, increase inactivity probes for the OVS bridge controllers and OVS manager:

Run the following commands:

ovs-vsctl set controller br-int inactivity_probe=60000
ovs-vsctl set controller br-tun inactivity_probe=60000
ovs-vsctl set controller br-floating inactivity_probe=60000

Identify the OVS manager ID:
```
ovs-vsctl list manager
```

Run the following command:

ovs-vsctl set manager <ovs_manager_id> inactivity_probe=30000

Avoiding ARP flux on the multi-interface Linux hosts¶

A Linux host with two or more Ethernet interfaces on the same subnet may cause the Address Resolution Protocol (ARP) flux issue when a machine responds to the ARP requests from both Ethernet interfaces of the same subnet.

To avoid the ARP flux issue:

Log in to the affected node.
Set the following network parameters:
```
sysctl -w net.ipv4.conf.all.arp_ignore=2
sysctl -w net.ipv6.conf.all.arp_ignore=2
```
This configuration enables responding to the ARP requests only if the target IP is a local address that is configured on the incoming interface and, together with the sender IP address, is part of the same subnet on this interface.

Running the controller node after outage causes errors¶

If the OpenStack controller node was affected by a host or network outage and was unable to communicate with other nodes of the cluster, you may occasionally receive 401 Unauthorized errors from Keystone. To resolve the issue, synchronize the Keystone fernet tokens and credentials with other OpenStack controller nodes of the cluster.

To synchronize the Keystone fernet tokens and credentials:

Synchronize the Keystone fernet tokens:

su -c "/var/lib/keystone/keystone_keys_rotate.sh -s -t fernet" keystone

If the OpenStack controller node was unavailable at 12:00 a.m., also synchronize the Keystone credentials:
```
su -c "/var/lib/keystone/keystone_keys_rotate.sh -s -t credential" keystone
```

Manually remove a broken Swift container¶

All Swift container operations such as updating a container metadata or removing a container fail if the name of the Swift container includes a forward slash (/) character. Typically, Swift does not accept container names that begin with the forward slash character. However, if you have such containers, you must remove them manually as described below. For example, you may get the following errors when trying to remove a bucket:

radosgw-admin bucket rm --bucket='{bucket_name}'
2020-04-23 21:17:37.581515 7f8880047e40 -1 ERROR: get_bucket_instance_from_oid failed: -2
2020-04-23 21:17:37.581535 7f8880047e40  0 could not get bucket info for bucket={bucket_name}

Note

RADOS Gateway uses the term bucket to describe a data container since it may be construed as the equivalent of the term container used within Swift.

Warning

The following procedure includes changes in an underlying RADOS GATEWAY object and can cause a major malfunction of Swift if performed incorrectly.

To remove a broken Swift container manually:

Optional. Identify the ID of the user who created the affected bucket. Run radosgw-admin user check against each user causing an inconsistency error.

radosgw-admin user list | grep \" | cut -d\" -f2 | while read uid; do echo '--- $uid'; rdosgw-admin user check --uid $uid; done
--- {user_id}

Example of system response:

--- {user_id1}
--- {user_id2}
--- {user_id3}
--- {user_id4}
--- {user_id5}
--- {user_id6}
--- {user_id7}
2020-04-23 21:17:37.581515 7f8880047e40 -1 ERROR: get_bucket_instance_from_oid failed: -2
2020-04-23 21:17:37.581535 7f8880047e40  0 could not get bucket info for bucket={bucket_name}
--- {user_id8}
--- {user_id9}

Remove the link between the user and the bucket:

rados -p default.rgw.meta -N users.uid rmomapkey {user_id}.buckets {bucket_name}

List all metadata objects for the broken bucket. The output contains a unique ID of the RADOS Gateway object.

rados -p default.rgw.meta ls --all | grep {bucket_name}

Example of system response:

root 8ea1a1093ba949deb63010c23caaff5a/{bucket_name}
root .bucket.meta.8ea1a1093ba949deb63010c23caaff5a:{bucket_name}:4d1cb33f-c11c-42ea-8f84-08f1ac56907a.96818.3

Remove the object using the bucket ID from the previous step. For example:

rados -p default.rgw.meta -N root rm 8ea1a1093ba949deb63010c23caaff5a/{bucket_name}
rados -p default.rgw.meta -N root rm .bucket.meta.8ea1a1093ba949deb63010c23caaff5a:{bucket_name}:4d1cb33f-c11c-42ea-8f84-08f1ac56907a.96818.3

Troubleshoot OpenContrail¶

This section includes workarounds for the OpenContrail-related issues in a running MCP cluster.

Troubleshoot the OpenStack-specific issues¶

This section includes the OpenStack-specific troubleshooting procedures for the OpenContrail services.

Troubleshoot Cassandra for OpenContrail 4.x¶

In case of issues with Cassandra not starting in OpenContrail 4.x, the system response of the doctrail analyticsdb contrail-status command on the nal nodes can be as follows:

== Contrail Database ==
contrail-database:            active
kafka:                        active
contrail-database-nodemgr:    initializing (Cassandra state detected DOWN.
                              Disk space for analytics db not retrievable.)

Workaround:

Remove all files from the /var/lib/cassandra/ folder:
```
rm -rf /var/lib/cassandra/*
```
Log in to the nal node in question.

Restart the service:

doctrail analyticsdb service contrail-database restart

Verify the Cassandra status is active:
```
doctrail analyticsdb contrail-status
```

Verify that the replication status of Cassandra is UN:

doctrail analyticsdb nodetool status

Example of system response:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns  Host ID                               Rack
UN  172.17.98.161  255.56 MB  256     ?     eafd7b42-b977-44fa-8c5d-6bed193ac87f  rack1
UN  172.17.98.163  389.98 MB  256     ?     89493967-bfdb-4300-bb39-08fb6f6d63f1  rack1
UN  172.17.98.162  292.7 MB   256     ?     91444200-c370-4a6e-b7f8-cab0e78f39eb  rack1

Troubleshoot Cassandra for OpenContrail 3.2¶

In case of issues with the OpenContrail 3.2 Cassandra database connection, the system response of the contrail-status command can be as follows:

== Contrail Analytics ==
supervisor-analytics:         active
contrail-alarm-gen            active
contrail-analytics-api        active
contrail-analytics-nodemgr    active
contrail-collector            initializing (Database:ctl02:Global connection down)
contrail-query-engine         timeout
contrail-snmp-collector       active
contrail-topology             active

Workaround:

Restart the supervisor-analytics service:
```
service supervisor-analytics restart
```

Verify the status of the Neutron API:

neutron net-list
+------------------------+-------------------------+---------+
| id                     | name                    | subnets |
+------------------------+-------------------------+---------+
| d6638b91-4e1d-4214-... | ip-fabric               |         |
| ce66ee12-71b4-44ea-... | __link_local__          |         |
| d452af3a-3b9f-442e-... | default-virtual-network |         |
+------------------------+-------------------------+---------+

Restart nova-api to reflect installed OpenContrail dependencies:

salt 'ctl*' service.restart nova-api
ctl02.workshop.cloudlab.cz:
    True
ctl03.workshop.cloudlab.cz:
    True
ctl01.workshop.cloudlab.cz:
    True

Verify the system response of the nova list command:
```
nova list
```

Now, the OpenStack and OpenContrail controller nodes are properly deployed.

TCP checksum errors on compute nodes¶

In the following cases, TCP-based services may not work on VMs or have TCP checksum errors increasing in the output of the dropstats command:

If the deployment has nested VMs.
If VMs are running in the ESXi hypervisor.
If the Network Interface Cards (NICs) do not support the IP checksum calculation and generate an incorrect checksum. For example, the Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe NIC cards.

To identify the issue:

Inspect the output of the dropstats command that shows the number of Checksum errors.

Inspect the output of the tcpdump command for a specific NIC. For example, for enp2s0f1. If you find cksum incorrect entires, the issue exists in your environment.

tcpdump -v -nn -l -i enp2s0f1.1162 host 10.0.2.162 | grep -i incorrect

Example of system response:

tcpdump: listening on enp2s0f1.1162, link-type EN10MB (Ethernet), capture size 262144 bytes
10.254.19.231.80 > 192.168.100.3.45506: Flags [S.], cksum 0x43bf (incorrect -> 0xb8dc), \
seq 1901889431, ack 1081063811, win 28960, options [mss 1420,sackOK,\
TS val 456361578 ecr 41455995,nop,wscale 7], length 0
10.254.19.231.80 > 192.168.100.3.45506: Flags [S.], cksum 0x43bf (incorrect -> 0xb8dc), \
seq 1901889183, ack 1081063811, win 28960, options [mss 1420,sackOK,\
TS val 456361826 ecr 41455995,nop,wscale 7], length 0
10.254.19.231.80 > 192.168.100.3.45506: Flags [S.], cksum 0x43bf (incorrect -> 0xb8dc), \
seq 1901888933, ack 1081063811, win 28960, options [mss 1420,sackOK,\
TS val 456362076 ecr 41455995,nop,wscale 7], length 0

If you do not find the checksum errors using the tcpdump command, inspect the output of the flow -l command that shows the information about a drop for unknown reason.

Workaround:

Turn off the transmit (TX) offloading on all compute nodes for the problematic NIC used by vRouter:

Run the following command:
```
ethtool -K <interface_name> tx off
```

Verify the status of the TX checksumming:

ethtool -k <interface_name>

Example of system response:

tx-checksumming: off
tx-checksum-ipv4: off
tx-checksum-ipv6: off
tx-checksum-sctp: off
tcp-segmentation-offload: off
tx-tcp-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]

See also

Juniper documentation

Troubleshoot HAProxy and LBaaS configuration¶

The issues with the creation of network namespaces may occur due to an incorrect configuration of HAProxy and LBaaS or due to contrail-svc-monitor not working properly.

Workaround:

Restart the contrail-svc-monitor service by restarting the supervisor-config service group:
```
service supervisor-config restart
```

Verify the OpenContrail status:

contrail-status

Example of the system response extract:

 == Contrail Config ==
 supervisor-config:            active
 contrail-api:0                active
 contrail-config-nodemgr       active
 contrail-device-manager       active
 contrail-discovery:0          active
 contrail-schema               active
 contrail-svc-monitor          active
 ifmap                         active

Verify the HAProxy configuration created by LBaaS in /var/lib/contrail/loadbalancer/haproxy/${LB_UUID}/.
Verify that HAProxy version >= 1.5 and iproute2 version >= 3.10.0 are installed on all compute nodes.

Neutron network ports for LBaaS are not deleted through Heat¶

If the Neutron network ports for LBaaS are not deleted through Heat, it may occur due to an incorrect Reclass configuration of the OpenStack ctl node that runs the Neutron server.

Workaround:

In your project repository, change the directory to /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/.
In the classes section of the control.yml file, verify that the following parameter is defined for the ctl node running the Neutron server:
```
- system.opencontrail.client.cluster
```
For the Contrail-specific Heat templates, define the following parameter in the classes section of the /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/control.yml file:
```
- system.heat.server.resource.contrail
```

Troubleshoot a VM network outage¶

This section explains how to troubleshoot a VM network outage.

To troubleshoot a VM network outage:

Verify the disk space, CPU load, and RAM on the VM in question.
Verify that the VM is enabled and has all interfaces up. You can do this using the Horizon Dashboard in the Admin > Instances tab or using CLI:
```
# Get VM status
nova list --all-tenants | grep <vmname>

# Get hypervisor name
nova show <vmname>
```
Ping the default gateway using ip r and ip a* and other VMs on the same network to identify whether it is a global, VM-related, or hypervisor-related problem.

Each VM has a virtual gateway usually at the first address. Pinging of a virtual gateway means that network connection between the VM and the hypervisor vRouter is not broken. This can show you a broken network connection inside the VM.

If you can ping the default gateway, but not anything outside or you cannot ping other VMs inside the virtual network, it can be a hypervisor-related issue. If it is the case, follow the steps below:
1. Log in to Horizon.
2. Identify the OpenContrail controller node that runs the VM hypervisor.
3. Log in to that OpenContrail controller node.
4. Verify the status of the supervisor-vrouter service using the contrail-status command.
5. If the supervisor-vrouter status is inactive:
  1. Restart supervisor-vrouter.
  2. Inspect the /var/log/contrail/contrail-vrouter* logs.
Verify that the VM unavailability is not caused by the firewall rules set in Security Groups or Network Policies in Horizon. Verify the security groups associated with the VM and the network policies attached to the virtual network.
Verify the peering status in the OpenContrail web UI navigating to Monitor > Control Nodes > Choose of them > Peers. The status should be Established, in sync. If it is not the case, select from the following options:
- Verify the availability of network devices. If the network devices are in the expected status (active by default), you can restart all OpenContrail controller nodes in sequence. Though, verify that at least two OpenContrail controller nodes are up and in a correct state. You should never restart all nodes at once.
- Restart the supervisor-control service and verify whether it is in the active status using the contrail-status command.
Verify the OpenContrail status on the OpenContrail node that runs the hypervisor with the VM.

If the contrail-status output contains some services not in the active status, restart the supervisor-vrouter process and verify the status again.

To identify the problem with the vRouter on the hypervisor or a VM inside the network setup, ping another VM on the same hypervisor or link-Local.

You can also ping the link-local IP address. To get this address, select from the following options:

Using the OpenContrail web UI, find the VM details on the Contrail Dashboard -> vRouter with VM page.

Using CLI, run the following command on the OpenStack compute node in question:

netstat -r

Example of system response:

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
default         10.0.106.1      0.0.0.0         UG        0 0          0 vhost0
localnet        *               255.255.255.0   U         0 0          0 vhost0
169.254.0.3     *               255.255.255.255 UH        0 0          0 vhost0
169.254.0.4     *               255.255.255.255 UH        0 0          0 vhost0
169.254.0.5     *               255.255.255.255 UH        0 0          0 vhost0

Using ssh to the VM from the hypervisor, run the following command:

vif --list

Example of system response:

Vrouter Interface Table

Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror
       Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2
       D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged
       Mnp=No MAC Proxy

vif0/0      OS: bond0.3034 (Speed 20000, Duplex 1)
            Type:Physical HWaddr:a4:1f:72:0a:93:8c IPaddr:0
            Vrf:0 Flags:TcL3L2Vp MTU:1514 Ref:22
            RX packets:9294622  bytes:1402159738 errors:0
            TX packets:14035541  bytes:10121866276 errors:0

vif0/1      OS: vhost0
            Type:Host HWaddr:a4:1f:72:0a:93:8c IPaddr:0
            Vrf:0 Flags:L3L2 MTU:1514 Ref:3
            RX packets:9649123  bytes:9390647810 errors:0
            TX packets:5671040  bytes:993417128 errors:0

Inspect the OpenContrail log files in /var/log/contrail/* and the /var/log/contrail/contrail-vrouter*.log file to debug vRouter, in particular.
If the problem is related to the networking inside the VM, verify the interface configuration, the DNS resolv.conf file, routing table, and so on.

VM does not have link-local (169.254.x.x) address upon boot¶

To troubleshoot the problem:

Verify the hypervisor at <hypervisor_ip>:8085/Snh_ItfReq to identify whether the configuration is present. For example:
```
http://10.92.249.119:8085/Snh_ItfReq?name=
```
If the configuration is missing, it means that the hypervisor did not receive it from the OpenContrail controller node.
Verify api-server at <ip>:8082/virtual-machine-interface/<uuid> to identify whether routing_instance_refs are present. For example:
```
curl -u admin:secret123 http://localhost:8095/virtual-machine-interface/39f | \
python -m json.tool > /tmp/vmis
```
If routing_instance_refs are missing:
1. Verify that contrail-schema is running on at least one OpenContrail controller node:
```
contrail-status
```
2. Verify that VMI is present in ifmap on all OpenStack controller nodes:
```
ifmap-view localhost 8443 visual visual | grep <vmi-name>
```
  If the VMI is missing in ifmap:
  1. Verify that RabbitMQ is clustered successfully and all API servers are connected to it:
    rabbitmqctl cluster_status
  2. Verify that contrail-api introspects the port at 8084 for rest/db/messagebus/ifmap.
3. To verify why contrail-schema does not establish the link between VMI and RI, inspect /var/log/contrail/schema.err.
4. Inspect the OpenContrail logs:
```
contrail-logs --object-type config --object-values
contrail-logs --object-type config --object-id <id_from_above_command>
```
Inspect the latest log from OpenContrail to stdout and other logs using the contrail-logs command.

Verify the connection status, for example:

http://<config-node-where-schema-active>:8084/Snh_SandeshUVECacheReq?x=NodeStatus

Troubleshoot SNAT and LBaaS namespaces¶

SNAT or LBaaS provisions two namespaces in the active-standby mode on two random OpenStack compute nodes. To display the namespaces, use the ip netns command.

The namespaces with vrouter-UUID represent SNAT and the namespaces with vrouter-UUID-UUID represent LBaaS. Namespaces can be managed as standard Linux networking using the netns command.

Slow response time from JMX Exporter¶

The third-party jmx-exporter service used by StackLight for exporting the OpenContrail Cassandra metrics may have a slow response time on the ntw node where the Cassandra backup is enabled. Usually, this is the ntw01 node.

You may also detect the following symptoms of the issue:

Grafana does not display metrics for Cassandra.
The PrometheusTargetDown alert for the jmx_cassandra_exporter job appears in the FIRING state for the ntw0x node.
The contrail-database-nodemgr service status is initializing.

Workaround:

Log in to the ntw01 node.
Verify that the Cassandra snapshots are automatically backed up in /var/backups/cassandra/. Otherwise, manually back them up in /var/lib/cassandra/data.

Clear the Cassandra snapshots. For example:

For OpenContrail 4.x:

doctrail controller nodetool -h localhost -p 7198 clearsnapshot

For OpenContrail 3.2:

nodetool -h localhost -p 7198 clearsnapshot

If clearing of snapshots does not resolve the issue, increase the scrape_interval and scrape_timeout values for jmx_cassandra_exporter:
1. Open your Git project repository with the Reclass model on the cluster level.
2. In cluster/<cluster_name>/stacklight/server.yml, modify the scrape parameters. For example:
```
prometheus:
  server:
    target:
      static:
        jmx_cassandra_exporter:
          scheme: http
          metrics_path: /metrics
          honor_labels: False
          scrape_interval: 60s
          scrape_timeout: 60s
```
3. Log in to the Salt Master node.
4. Apply the changes to the Reclass model:
```
salt 'cfg01*' state.apply reclass.storage
salt '*' saltutil.sync_all
```
5. Apply the following state:
```
salt -C 'I@prometheus:server' state.sls prometheus
```
6. Connect to Grafana as described in Connect to Grafana.
7. Navigate to the Cassandra dashboard.
8. Verify that the rate_interval value is more than 1m.

FIP is not associated to an LB port through Terraform¶

If you use Terraform for creating resources in OpenStack, due to a limitation, Terraform cannot associate a floating IP (FIP) address to a load balancer (LB) port managed by the Avi Vantage tool. This limitation is related to the following resource in terraform-openstack-provider:

resource "openstack_networking_floatingip_associate_v2" "association" {
  floating_ip = "<Floating-ip address>"
  port_id = "<LoadBalancer port id>"
}

Workaround:

Associate a floating IP address to an LB port using Horizon or the OpenStack CLI by running the following command:

neutron floatingip-associate <floating_ip_id> <lb_port_id>

See also

Terraform issue

Troubleshoot OpenContrail 3.2¶

This section includes procedures to troubleshoot the OpenContrail 3.2 services.

Perform initial troubleshooting¶

This section describes basic troubleshooting steps for the OpenContrail-related services.

To perform initial troubleshooting:

Verify the NTP peers on every node of your MCP cluster:

ntpq -p

Example of system response:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+tik.cesnet.cz   195.113.144.238  2 u  728 1024  377    4.645   -0.199   0.545
*netopyr.hanacke .GPS.            1 u 1604 1024  276   14.931   -0.021   0.373

If at least one of peers has * before its name, time is synchronized. Otherwise, inspect the /etc/ntp.conf file .

Example of an ntp.conf file

# Associate to cloud NTP pool servers
server ntp.cesnet.cz iburst
server pool.ntp.org

# Only allow read-only access from localhost
restrict default noquery nopeer
restrict 127.0.0.1
restrict ::1

# Location of drift file
driftfile /var/lib/ntp/ntp.drift
logfile /var/log/ntp.log

Verify the disk space, Inode, RAM, and CPU usage on every OpenContrail node. The total amount of used resources in the output must be maximum 90%.

To verify the disk space:

df -h

Example of system response:

Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G   12K  3.9G   1% /dev
tmpfs           799M  380K  798M   1% /run
/dev/vda1        48G  5.7G   41G  13% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            3.9G   12K  3.9G   1% /run/shm
none            100M     0  100M   0% /run/user

To verify the Inode usage:

df -i

Example of system response:

Filesystem       Inodes   IUsed    IFree IUse% Mounted on
udev            2032563     533  2032030    1% /dev
tmpfs           2037690     781  2036909    1% /run
/dev/sda1       6250496 1396006  4854490   23% /
tmpfs           2037690     304  2037386    1% /dev/shm
tmpfs           2037690       6  2037684    1% /run/lock
tmpfs           2037690      18  2037672    1% /sys/fs/cgroup
/dev/sda6      53821440  731583 53089857    2% /home
cgmfs           2037690      14  2037676    1% /run/cgmanager/fs
tmpfs           2037690      44  2037646    1% /run/user/1000

To verify RAM usage:

free -h

Example of system response:

             total       used       free     shared    buffers     cached
Mem:          7.8G       7.3G       501M       416K       239M       2.6G
-/+ buffers/cache:       4.5G       3.3G
Swap:           0B         0B         0B

To verify CPU usage:

cat /proc/stat | grep cpu | awk \
'{unit=100/($1+$2+$3+$4+$5+$6+$7+$8+$9+$10); print $1 "\tidle: "  $5*unit  "%"}'

Example of system response:

cpu     idle: 94.1113%
cpu0    idle: 94.3852%
cpu1    idle: 92.851%
cpu2    idle: 94.0428%
cpu3    idle: 94.1673%
cpu4    idle: 94.2658%
cpu5    idle: 94.3526%
cpu6    idle: 94.4082%
cpu7    idle: 94.4092%

Verify MTU and the status of interfaces on all OpenContrail nodes:

ip link

Example of system response:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether ac:de:48:b0:2d:3e brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether ac:de:48:a8:7a:09 brd ff:ff:ff:ff:ff:ff

Verify whether the current number of files opened by Linux kernel is not over-limited:
```
cat /proc/sys/fs/file-nr
```
Example of system response:
```
17736    0    1609849
```

OpenContrail services are down¶

If any OpenContrail service fails, start with a basic troubleshooting as described below. If it does not help, proceed with troubleshooting a specific service as required.

To preform basic service troubleshooting:

Log in to any OpenContrail controller node.
Verify the status of the service in question using the service <service_name> status command.
Inspect the logs of the service in question in /var/log/contrail/<service-name>.log.
Restart the service using the service <service_name> restart command.
Verify whether the service is not flapping. For example, see: Verify the OpenContrail svc-monitor.

Verify contrail-vrouter-agent¶

While troubleshooting OpenContrail, you may have issues with the contrail-vrouter-agent configuration that fails to be created.

To verify the contrail-vrouter-agent configuration:

Log in to any OpenStack compute node.
Verify whether the global vRouter configuration is created:
```
contrail-status
```
Example of system response:
```
 == Contrail vRouter ==
 supervisor-vrouter:           active
 contrail-vrouter-agent        initializing (No Configuration for self)
 contrail-vrouter-nodemgr      active
```
If the status is active, the configuration is created successfully. If the contrail-vrouter-agent status is initializing (No Configuration for self), proceed to the next step.
Use either the OpenContrail web UI or SaltStack commands to create the global vRouter configuration:
- Using the OpenContrail web UI:
  1. Log in to the OpenContrail web UI as admin.
  2. Go to Configure > Infrastructure > Global Config.
  3. On the right side above the table, click Edit. If the global configuration is missing, the window with default values opens.
  4. Click Save.
- Using SaltStack:
  1. Log in to the Salt Master node.
  2. Depending on your deployment, apply one of the following states:
    - For Kubernetes:
      salt 'ctl01*' state.sls opencontrail.client
    - For OpenStack:
      salt 'ntw01*' state.sls opencontrail.client
  3. Verify if global-vrouter-config was created:
    - For Kubernetes:
      salt 'ctl01*' contrail.global_vrouter_config_list
    - For OpenStack:
      salt 'ntw01*' contrail.global_vrouter_config_list
    Note
    
    If the output is empty, add system.opencontrail.client.resource.global_vrouter_config as the last system class in /srv/salt/reclass/classes/cluster/k8s-ha-contrail-40/opencontrail/control.yml and reapply the opencontrail.client state for Kubernetes or OpenStack as described in the step 2.
  4. Verify the contrail-vrouter-agent status:
    salt 'cmp*' cmd.run "contrail-status"
  5. If global-vrouter-config is still missing, perform the following manual steps:
    1. Log in to any OpenContrail controller node, for example, ntw01.
    2. Run the following command:
      salt-call contrail.global_vrouter_config_create name="global-vrouter-config" \ parent_type="global-system-config" encap_priority="MPLSoUDP,MPLSoGRE" \ vxlan_vn_id_mode="automatic" \ fq_names=['default-global-system-config','default-global-vrouter-config']
Verify that the global vRouter configuration is created successfully and is in the active status using the contrail-status command.

Verify the default route in VRF¶

Usually, if an instance can ping the default gateway, for example, vRouter, but does not have access to the Internet or cannot ping an IP address outside the cloud, you should verify the default route in virtual routing and forwarding (VRF).

To verify the default route in VRF:

Log in to the OpenContrail web UI.
Go to Monitor > Infrastructure > Virtual Routers > cmp00x > Routes, where cmp00x is the name of the compute node in question.
From the VRF drop-down list, select the VRF of the virtual network in question.

The following example displays the default route 0.0.0.0/0 with two valid peer routes:
Log in to any Mirantis OpenContrail analytics node, for example, nal01.

Verify the status of the OpenContrail analytics services:

contrail-status

Example of system response:

== Contrail Analytics ==
supervisor-analytics:         active
contrail-alarm-gen            active
contrail-analytics-api        active
contrail-analytics-nodemgr    active
contrail-collector            active
contrail-query-engine         active
contrail-snmp-collector       active
contrail-topology             active

== Contrail Supervisor Database ==
supervisor-database:          active
contrail-database             active
contrail-database-nodemgr     active
kafka                         active

If some service is not in the active status, fix the issue and restart the service using the service <service_name> restart.
Verify the contrail-svc-monitor status:
1. Log in to the Mirantis OpenContrail controller ntw node.
2. Verify that the contrail-svc-monitor state is active using the contrail-status command.
  
  Example of system response:
```
== Contrail Config ==
supervisor-config:            active
contrail-api:0                active
contrail-config-nodemgr       active
contrail-device-manager       active
contrail-discovery:0          active
contrail-schema               initializing
contrail-svc-monitor          active
ifmap                         active
```
3. If the contrail-svc-monitor state is inactive or initializing with an error message:
  1. Log in to the Mirantis OpenContrail controller node where contrail-svc-monitor is in the active state. For example, to ntw01.
  2. Restart the service using the following command. On the remaining ntw nodes, the contrail-svc-monitor state must be backup.
    service contrail-svc-monitor restart
  3. If the state is still inactive or initializing with an error message, recreate the gateway:
    1. Log in to the Horizon web UI.
    2. Go to Project > Network > Routers.
    3. On the right side of the router, click Clear Gateway.
    4. On the right side of the router, click Set Gateway.
    5. In the External Network field, specify the network to which the router will connect.
    6. Click Set Gateway.
Verify the SNAT and other services instances as described in Verify SNAT and other services instances.

Verify the OpenContrail svc-monitor¶

The contrail-svc-monitor service listens to the configuration changes of the service templates and service instances as well as spawns and monitors virtual machines for the firewall, analyzer services, and so on. In the multi-node deployments, it works in the active/backup mode.

This section describes how to verify if the svc-monitor is flapping.

To verify that svc-monitor is not flapping:

Run the following command every 20 seconds many times, until the contrail-svc-monitor status changes to active on one ntw node and to backup on other ntw nodes:

salt 'ntw0*' cmd.run 'contrail-status -d | grep monitor'

Example of system response:

ntw01.mcp11-opencontrail.local:
contrail-svc-monitor          backup     pid 28442, uptime 7 days, 5:45:53
ntw03.mcp11-opencontrail.local:
contrail-svc-monitor          backup     pid 31439, uptime 7 days, 5:46:30
ntw02.mcp11-opencontrail.local:
contrail-svc-monitor          active     pid 7069, uptime 7 days, 5:46:35

If contrail-svc-monitor service with active state during tests and uptime of all processes is not restarted, then the contrail-svc-monitor is not flapping. Otherwise, proceed to Verify the default route in VRF.

Verify route target references¶

Updating OpenContrail from version 3.1.1 to version 3.2.3 can cause inconsistencies in the OpenContrail configuration database. These inconsistencies may cause connectivity issues from or to a VM, between VMs, and so on. Verifying route target references may help to resolve such issues.

To verify route target references:

Log in to any Mirantis OpenContrail controller ntw node.
Start contrail-api-cli. For details, see Use the OpenContrail API client.
Run the following command using the OpenContrail controller node VIP. For example, 10.167.4.20:
```
contrail-api-cli --host 10.167.4.20 --port 9100 shell
```

Verify whether route-target exists in more than one project:

10.167.4.21:/> cat route-target/* | jq  -c 'if(.routing_instance_back_refs[0]? ) \
then if ([.routing_instance_back_refs[].to[1]] | unique | length)>1 \
then [.display_name,[.routing_instance_back_refs[].to[1]]] | \
unique else empty end else empty end'

Example of system response:

["target:64512:8000001",["TestingTenant","admin"]]
["target:64512:8000002",["admin","demo-jdc"]]

In the output, the list of suspected route-targets is displayed with names of referenced projects.

Identify the UUID of the suspected route-target using its name. For example:

10.167.4.21:/> ls route-target -l | grep target:64512:8000002

Example of system response:

route-target/8ed6241c-a1e7-4abf-85c0-3afc70a0b3f0  target:64512:8000002

Verify the status of route-target using the UUID of the suspected route-target. For example:

10.167.4.21:/> cat route-target/8ed6241c-a1e7-4abf-85c0-3afc70a0b3f0

Example of system response:

{
"display_name": "target:64512:8000002",
"fq_name": [
  "target:64512:8000002"
],
...
},
"routing_instance_back_refs": [
  {
    "attr": {
      "import_export": null
    },
    "href": "http://10.167.4.21:9100/routing-instance/cb6a97c8-aca9-4dd1-af00-84368c46d784",
    "to": [
      "default-domain",
      "admin",
      "admin-net2",
      "admin-net2"
    ],
    "uuid": "cb6a97c8-aca9-4dd1-af00-84368c46d784"
  },
  {
    "attr": {
      "import_export": null
    },
    "href": "http://10.167.4.21:9100/routing-instance/4db24ca5-7f10-42b8-b7dc-7c2d12a9d6a3",
    "to": [
      "TestDomain",
      "demo-jdc",
      "jdc-net03",
      "jdc-net03"
    ],
    "uuid": "4db24ca5-7f10-42b8-b7dc-7c2d12a9d6a3"
  }
],
"uuid": "8ed6241c-a1e7-4abf-85c0-3afc70a0b3f0"
}

In the output above, the routing_instance_back_refs section contains the route-target with to back references to different routing instances in two different projects. If you know what back reference is incorrect, delete it as described below.

To delete a back reference:

Select from the following options:

Edit the corresponding JSON records directly.
Use the ln command:
1. Exit from contrail-api-cli.
2. Start contrail-api-cli with the --schema-version parameter. For more information, see Use the OpenContrail API client.
3. Remove the back reference of route-target. For example:
```
10.167.4.21:/> ln -r route-target/8ed6241c-a1e7-4abf-85c0-3afc70a0b3f0 \
 routing-instance/cb6a97c8-aca9-4dd1-af00-84368c46d784
```
4. Verify route-target again using the procedure above.

Verify ifmap-server¶

OpenContrail uses the standard Interface for Metadata Access Point (IF-MAP) mechanism for the configuration data distribution among OpenContrail configuration and control nodes.

To verify ifmap-server:

Verify that the virtual machine interface (VMI) is present in ifmap.

Select from the following options:

Run the following command:

ifmap-view visual visual 2>&1  | grep "contrail:virtual-machine-interface"

Example of system response:

'contrail:virtual-machine-interface:default-domain:TestingTenant:071b9c54-25ed-41cf-ab89-4f60b60e0e66',
'contrail:virtual-machine-interface:default-domain:TestingTenant:7c05158a-5c4b-458c-bb88-e6de2146856e',
'contrail:virtual-machine-interface:default-domain:TestingTenant:3187786b-6f11-4c3b-9d0d-3ce9d02de4cf',
'contrail:virtual-machine-interface:default-domain:TestingTenant:3187786b-6f11-4c3b-9d0d-3ce9d02de4cf',

Verify the virtual machine interfaces directly by ID. For example:

ifmap-view visual visual 2>&1  | grep 071b9c54-25ed-41cf-ab89-4f60b60e0e66

Example of system response:

'contrail:virtual-machine-interface:default-domain:TestingTenant:071b9c54-25ed-41cf-ab89-4f60b60e0e66'

If the record of the virtual machine interface in question is missing on the OpenContrail controller node, restart the ifmap-server and contrail-config-nodemgr services on this node:
```
service ifmap-server restart
service supervisor-config restart
```

Verify the number of result items:

ifmap-view 10.10.10.201 visual visual  2>&1 | grep <virtual_machine>

Example of system response:

INFO: Number of result items for search = 72
INFO: Properties on identifier = ['{http://www.contrailsystems.com/vnc_cfg.xsd}id-perms']
INFO: Links from identifier = ['contrail:domain:default-domain',

The number of result items should be the same on all OpenContrail controller nodes.

If the nodes have a different number of result items, restart the ifmap-server and contrail-config-nodemgr services on these nodes:
```
service ifmap-server restart
service supervisor-config restart
```

Verify the authentication parameters defined in /etc/ifmap-server/basicauthusers.properties. Each IF-MAP client requires a unique user name. All OpenContrail controller nodes must have unique IF-MAP client IDs.

Example:

test:test
test2:test2
test3:test3
api-server:api-server
schema-transformer:schema-transformer
svc-monitor:svc-monitor
control-user:control-user-passwd
control-node-1:control-node-1
control-node-2:control-node-2
control-node-3:control-node-3
dhcp:dhcp
visual:visual
sensor:sensor

Verify the ifmap-server logs in the following directories:
- /var/log/contrail/ifmap-server.log
- /var/log/contrail/ifmap-stdout.log
Verify the introspect of the IF-MAP server over HTTP on the port 8083. Use the following introspect address format: http://<ntw0x_ip_address>:8083/ifmap_server_show.xml.
1. In the IFMapPeerServerInfoReq section, click Send or open the following address in a browser: http://<ntw0x_ip_address>:8083/Snh_IFMapPeerServerInfoReq.
  
  The following screen shows an example of the IF-MAP server information:
2. In the ds_peer_info and ds_peer_list sections, verify the information about all IF-MAP peers. In the example above, two OpenContrail controller nodes are displayed. The value of one peer node must be true in the In_use column.
Repeat the steps above on the remaining OpenContrail controller nodes.

Troubleshoot the vMX router¶

MCP uses OpenContrail with vMX routers in its cloud deployments as they provide a rich set of features particularly beneficial for NFV use cases and which allow easy network scale-out.

Note

AS (Autonomous System) number is a 2/4 byte identifier for a network segment/organization
OpenContrail supports only 2-byte AS numbers (1-65534)
Typically, the private AS numbers are being used (64512-65534)
The AS number must be the same on the MX and contrail controllers
MX uplink peers might be in different ASNes

Warning

For this section vSRX has been used instead of vMX, but the process is same for both of them.

To troubleshoot the vMX router:

Verify the BGP Routers configuration using the Introspect section in web UI. Web UI is accessible directly or through HAProxy with the port 9100.

curl http://control01:8082/bgp-routers

The command above returns a list of all routers defined in the OpenContrail cluster.

Example of system response:

{
  "bgp-routers": [
    {
      "uuid": "443af522-2463-4960-a2b3-77b6b6a46fef",
      "fq_name": [
        "default-domain",
        "default-project",
        "ip-fabric",
        "__default__",
        "ntw03"
      ],
      "href": "http://10.167.4.20:8082/bgp-router/443af522-2463-4960-a2b3-77b6b6a46fef"
    },
    ...
    {
      "uuid": "8af298c2-2a12-452b-86ce-54bff3ce2d9d",
      "fq_name": [
        "default-domain",
        "default-project",
        "ip-fabric",
        "__default__",
        "vsrx1"
      ],
      "href": "http://10.167.4.20:8082/bgp-router/8af298c2-2a12-452b-86ce-54bff3ce2d9d"
    }
  ]
}

Display the detailed information about the vRouter using the URL from the command above:

curl http://10.167.4.20:8082/bgp-router/443af522-2463-4960-a2b3-77b6b6a46fef

Example of system response:

{
  "bgp-router": {
    "name": "vsrx1",
    ...
    "fq_name": [
      "default-domain",
      "default-project",
      "ip-fabric",
      "__default__",
      "vsrx1"
    ],
    "bgp_router_refs": [
      {
        "uuid": "fb39d3e8-6be0-4e69-b13d-bd69c1685c6c",
        "attr": {
          "session": ....
        },
        "href": "http://10.167.4.22:9100/bgp-router/fb39d3e8-6be0-4e69-b13d-bd69c1685c6c",
        "to": [
          "default-domain",
          "default-project",
          "ip-fabric",
          "__default__",
          "ntw03"
        ]
      },
      {
        "uuid": "426affc9-b05c-47d8-b0ba-ffe72e59d984",
        "attr": {
          "session": .....
        },
        "href": "http://10.167.4.22:9100/bgp-router/426affc9-b05c-47d8-b0ba-ffe72e59d984",
        "to": [
          "default-domain",
          "default-project",
          "ip-fabric",
          "__default__",
          "ntw02"
        ]
      },
      {
        "uuid": "50f1a77e-9807-4024-b889-f771f2b97835",
        "attr": {
          "session": .....
        },
        "href": "http://10.167.4.22:9100/bgp-router/50f1a77e-9807-4024-b889-f771f2b97835",
        "to": [
          "default-domain",
          "default-project",
          "ip-fabric",
          "__default__",
          "ntw01"
        ]
      }
    ],
    "display_name": "vsrx1",
    "uuid": "2097b2c0-65ac-4c2f-ab0b-aaef2bf9e95a",
    "parent_uuid": "8356722f-02d9-4a57-baaf-1f5013e263f5",
    "parent_href": "http://10.167.4.22:9100/routing-instance/8356722f-02d9-4a57-baaf-1f5013e263f5",
    "parent_type": "routing-instance",
    "bgp_router_parameters": {
      "address_families": {
        "family": [
          "route-target",
          "inet-vpn",
          "e-vpn",
          "inet6-vpn"
        ]
      },
      "autonomous_system": 64512,
      "hold_time": 0,
      "identifier": "10.167.4.100",
      "router_type": "router",
      "source_port": null,
      "gateway_address": null,
      "vendor": "mx",
      "admin_down": false,
      "ipv6_gateway_address": null,
      "port": 179,
      "local_autonomous_system": null,
      "auth_data": null,
      "address": "10.167.4.100"
    },
    "perms2": {
      "share": [],
      "global_access": 0,
      "owner_access": 7,
      "owner": "cloud-admin"
    },
    "href": "http://10.167.4.22:9100/bgp-router/2097b2c0-65ac-4c2f-ab0b-aaef2bf9e95a"
  }
}

In the output above, you can verify such important parameters as autonomous_system, vendor, and others.

Verify peer BGR routers and the AS number:

root@vsrx1% cli
root@vsrx1> show bgp summary

Example of system response:

Groups: 1 Peers: 3 Down peers: 0
Unconfigured peers: 3
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
bgp.l3vpn.0           54         27          0          0          0          0
Peer               AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
10.167.4.21           64512      66176      66588       0       0     3w1d23h Establ
  bgp.l3vpn.0: 0/0/0/0
10.167.4.22           64512     117437     100892       0       0     4w6d20h Establ
  bgp.l3vpn.0: 27/27/27/0
  public.inet.0: 5/5/5/0
10.167.4.23           64512      85912      69311       0       0     3w2d21h Establ
  bgp.l3vpn.0: 0/27/27/0
  public.inet.0: 0/5/5/0

View the current configuration:

root@vsrx1> show configuration routing-options

Example of system response:

route-distinguisher-id 10.109.3.250;
autonomous-system 64512;
dynamic-tunnels {
    dynamic_overlay_tunnels {
        source-address 10.167.4.100;
        gre;
        destination-networks {
            10.109.3.0/24;
            172.16.10.0/24;
            10.167.4.0/24;
        }
    }
}

Note

The command above returns current configuration. When you want to view the latest changes, use compare rollback:

root@vsrx1> show configuration | compare rollback 1

-    source-address 10.167.4.20;
+    source-address 10.167.4.100;

The number after rollback signifies the number of commit to which to compare this configuration.

If you have a BGP peer down error with incorrect family:
1. Verify the BGP peer UVE:
```
curl http://nal01:9081/analytics/uves/bgp-peers
```
  User Visible Entities are OpenContrail resources, such as virtual network, virtual machines, vrouter, routing-instances, and so on. UVE APIs are used to query these resources.

Search for the vMX/vSRX BGP peer by name in the list.

In the sample output, families is the family advertised by the peer and configured_families is what is provisioned. In the sample output, the families configured on the peer have a mismatch, thus the peer does not move to an established state. You can verify it in the peer UVE.
Fix the families mismatch in the sample by updating the configuration on the MX Series router, using Junos CLI:
```
set protocols bgp group contrail-control-nodes family inet-vpn unicast
```
After committing the CLI configuration, the peer comes up. Verify it with UVE.
Verify the peer status on the MX router using Junos CLI:
```
run show bgp neighbor 10.167.4.21
```

Check the router in MX/vSRX:

Use Junos CLI show commands from the router to check the route.

root@vsrx1> run show route table public.inet.0

public.inet.0: 8 destinations, 13 routes (8 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

0.0.0.0/0          *[Static/5] 4w6d 22:49:22
                    > to 172.17.32.193 via ge-0/0/0.0
172.17.32.192/26   *[Direct/0] 4w6d 22:49:22
                    > via ge-0/0/0.0
172.17.32.240/32   *[Local/0] 4w6d 22:49:22
                      Local via ge-0/0/0.0
<floating_ip0>/32  *[BGP/170] 4w3d 00:41:07, MED 100, localpref 200, from 10.167.4.22
                      AS path: ?
                    > via gr-0/0/0.32769, Push 40
                    [BGP/170] 3w3d 00:30:48, MED 100, localpref 200, from 10.167.4.23
                      AS path: ?
                    > via gr-0/0/0.32769, Push 40
<floating_ip1>/32  *[BGP/170] 3w3d 00:28:16, MED 100, localpref 200, from 10.167.4.22
                      AS path: ?
                    > via gr-0/0/0.32770, Push 19
                    [BGP/170] 3w3d 00:30:48, MED 100, localpref 200, from 10.167.4.23
                      AS path: ?
                    > via gr-0/0/0.32770, Push 19
<floating_ip2>/32  *[BGP/170] 4w5d 23:22:58, MED 100, localpref 200, from 10.167.4.22
                      AS path: ?
                    > via gr-0/0/0.32769, Push 29
                    [BGP/170] 3w3d 00:30:48, MED 100, localpref 200, from 10.167.4.23
                      AS path: ?
                    > via gr-0/0/0.32769, Push 29
<floating_ip2>/32  *[BGP/170] 2d 01:50:04, MED 100, localpref 200, from 10.167.4.22
                      AS path: ?
                    > via gr-0/0/0.32770, Push 37
                    [BGP/170] 2d 01:50:04, MED 100, localpref 200, from 10.167.4.23
                      AS path: ?
                    > via gr-0/0/0.32770, Push 37
<floating_ip4>/32  *[BGP/170] 2d 01:31:11, MED 100, localpref 200, from 10.167.4.22
                      AS path: ?
                    > via gr-0/0/0.32770, Push 39
                    [BGP/170] 2d 01:31:11, MED 100, localpref 200, from 10.167.4.23
                      AS path: ?
                    > via gr-0/0/0.32770, Push 39

In the output above, you can find the floating IP address what you want to debug.

To view the detailed output, run:

root@vsrx1> show route 172.17.35.8/32 detail

Proceed to Troubleshoot a VM forward and reverse flow.

Troubleshoot a VM forward and reverse flow¶

This section describes how to receive the forward and reverse flow record information of VMs from their respective vRouter compute nodes. This information helps troubleshooting communication issues between virtual machines.

The image below displays an example of communication between VM_1 and VM_2:

The following table shows known information about troubleshooted flows described in this section:

Value name	VM_1	VM_2
prefix local	192.168.0.5/32	192.168.0.100/32
prefix remote	192.168.0.100/32	192.168.0.5/32

To show flows between VMs:

Run the following command on one of the hosts:

For compute 002:

root@cmp002:~# flow -l | grep '192.168.0.5\|192.168.0.100'

Example of system response:

 Index                Source:Port/Destination:Port                      Proto(V)
-----------------------------------------------------------------------------------
492152<=>1500364      192.168.0.5:792                                     1 (5)
                      192.168.0.100:0
(Gen: 1, K(nh):83, Action:F, Flags:, QOS:-1, S(nh):83,  Stats:487/519142,


1500364<=>492152      192.168.0.100:792                                   1 (5)
                      192.168.0.5:0
(Gen: 1, K(nh):83, Action:F, Flags:, QOS:-1, S(nh):35,  Stats:487/519142,

For compute 003:

root@cmp002:~# flow -l | grep '192.168.0.5\|192.168.0.100'
# OR
root@cmp002:~# flow --match "192.168.0.5,192.168.0.100"

Example of system response:

 Index                Source:Port/Destination:Port                      Proto(V)
-----------------------------------------------------------------------------------
1274292<=>1451388      192.168.0.5:792                                     1 (1)
                       192.168.0.100:0
(Gen: 15, K(nh):98, Action:F, Flags:, QOS:-1, S(nh):63,  Stats:983/1047878,

1451388<=>1274292      192.168.0.100:792                                   1 (1)
                       192.168.0.5:0
(Gen: 9, K(nh):98, Action:F, Flags:, QOS:-1, S(nh):98,  Stats:983/1047878,

In the output, you can see the ICMP request and response.

Pay attention to the following important parts of the output:

Action: If you get Action: D(Sg), there might be an issue in security groups.
K(nh): It means that next hops are assigned to prefixes. It represents nh-id for the flow.
S(nh): This is the next hop used for RPF checks. When a flow is being setup, agent will do route lookup for the source IP and sets up the rpf-nh in the flow. The type of NH used depends on matched route for the source IP.

The following table describes the abbreviations used in the flow output:

Parameter	Abbreviation description
Action	F=Forward D=Drop N=NAT(S=SNAT, D=DNAT, Ps=SPAT, Pd=DPAT, L=Link Local Port)
Other	K(nh)=Key_Nexthop S(nh)=RPF_Nexthop
Flags	E=Evicted Ec=Evict Candidate N=New Flow M=Modified Dm=Delete Marked
TCP(r=reverse)	S=SYN F=FIN R=RST C=HalfClose E=Established D=Dead

Now, we have the following information about troubleshooted flows described in this section:

Value name	VM_1	VM_2
prefix local	192.168.0.5/32	192.168.0.100/32
prefix remote	192.168.0.100/32	192.168.0.5/32
nh local	83	98
nh remote	35	63

To get VRF ID:

Run the following commands:

For compute 002:

root@cmp002:~# nh --get 83

Id:83         Type:Encap          Fmly: AF_INET  Rid:0  Ref_cnt:4          Vrf:5
              Flags:Valid, Policy,
              EncapFmly:0806 Oif:8 Len:14
              Encap Data: 02 29 64 b3 e2 f4 00 00 5e 00 01 00 08 00

root@cmp002:~# nh --get 35

Id:35         Type:Tunnel         Fmly: AF_INET  Rid:0  Ref_cnt:4850       Vrf:0
              Flags:Valid, MPLSoUDP,
              Oif:0 Len:14 Flags Valid, MPLSoUDP,  Data:0c c4 7a 50 27 88 0c c4 7a 17 99 5d 08 00
              Vrf:0  Sip:10.167.4.102  Dip:10.167.4.103

For compute 003:

root@cmp003:~# nh --get 98

Id:98         Type:Encap          Fmly: AF_INET  Rid:0  Ref_cnt:5          Vrf:1
              Flags:Valid, Policy,
              EncapFmly:0806 Oif:13 Len:14
              Encap Data: 02 d1 32 42 b5 87 00 00 5e 00 01 00 08 00

root@cmp003:~# nh --get 63

Id:63         Type:Tunnel         Fmly: AF_INET  Rid:0  Ref_cnt:20         Vrf:0
              Flags:Valid, MPLSoUDP,
              Oif:0 Len:14 Flags Valid, MPLSoUDP,  Data:0c c4 7a 17 99 5d 0c c4 7a 50 27 88 08 00
              Vrf:0  Sip:10.167.4.103  Dip:10.167.4.102

Using the outputs above, we add the following information about the troubleshooted flows described in this section:

Value name	VM_1	VM_2
prefix local	192.168.0.5/32	192.168.0.100/32
prefix remote	192.168.0.100/32	192.168.0.5/32
nh local	83	98
nh remote	35	63
VRF local	5	1
VRF remote	0	0

To identify tap interface used by VMs:

Identify the tap interface used by VMs:

neutron port-list | grep 192.168.0.5 | awk '{print $2}' | cut -c 1-11 | awk '{print "tap"$1}'

tap2964b3e2-f4

neutron port-list | grep 192.168.0.100 | awk '{print $2}' | cut -c 1-11 | awk '{print "tap"$1}'

tapd13242b5-87

You cam also get this information from the compute cmp node using the 0if value of the nh --get XY command output. For example:

nh --get 83

Example of system response:

Id:83         Type:Encap          Fmly: AF_INET  Rid:0  Ref_cnt:4          Vrf:5
              Flags:Valid, Policy,
              EncapFmly:0806 Oif:8 Len:14
              Encap Data: 02 29 64 b3 e2 f4 00 00 5e 00 01 00 08 00

vif --get 8

Example of system response:

Vrouter Interface Table

Flags: P=Policy, X=Cross Connect, S=Service Chain, Mr=Receive Mirror
       Mt=Transmit Mirror, Tc=Transmit Checksum Offload, L3=Layer 3, L2=Layer 2
       D=DHCP, Vp=Vhost Physical, Pr=Promiscuous, Vnt=Native Vlan Tagged
       Mnp=No MAC Proxy, Dpdk=DPDK PMD Interface, Rfl=Receive Filtering Offload, Mon=Interface is Monitored
       Uuf=Unknown Unicast Flood, Vof=VLAN insert/strip offload, Df=Drop New Flows, Proxy=MAC Requests Proxied Always

vif0/8      OS: tap2964b3e2-f4
            Type:Virtual HWaddr:00:00:5e:00:01:00 IPaddr:0
            Vrf:5 Flags:PL3L2D QOS:-1 Ref:5
            RX packets:7221  bytes:5897180 errors:0
            TX packets:7229  bytes:6023816 errors:0
            Drops:15

You can compare the Vrf parameter of this output with the output for Id:83.

Using the outputs above, we add the following information about the troubleshooted flows described in this section:

Value name	VM_1	VM_2
prefix local	192.168.0.5/32	192.168.0.100/32
prefix remote	192.168.0.100/32	192.168.0.5/32
nh local	83	98
nh remote	35	63
VRF local	5	1
VRF remote	0	0
Tap Interface	tap2964b3e2-f4	tapd13242b5-87

To verify a routing table for VRF (routing instance):

Show or dump the routing table of the required VRF using the rt command

For compute 002:

root@cmp002:~# rt --dump 5 | grep 192.168.0.5/32
192.168.0.5/32         32            P          -       83        2:29:64:b3:e2:f4(240568)

root@cmp002:~# rt --dump 5 | grep 192.168.0.100/32
192.168.0.100/32       32           LP         44       35        2:d1:32:42:b5:87(149684)

For compute 003:

root@cmp003:~# rt --dump 1 | grep 192.168.0.5/32
192.168.0.5/32         32           LP         33       63        2:29:64:b3:e2:f4(159964)

root@cmp003:~# rt --dump 1 | grep 192.168.0.100/32
192.168.0.100/32       32            P          -       98        2:d1:32:42:b5:87(227736)

If the above procedure does not resolve the issues, proceed with the following steps:

Verify that a network policy is created.
Verify that a network policy is assigned to VMs.
Verify the rules of a security group.
Verify that virtual networks (VNs) are assigned a correct security group.
Verify that the routing table using the rt --dump <VRF_ID> if prefixes were exchanged.
Verify MTU on physical interface used by vRouter.
Verify the MTU jumbo frames in the underlay network.

See also

OpenContrail operations

updated: 2025-01-10 09:00