MCP Operations Guide Q4`18 documentation

MCP Operations Guide

Preface

This documentation provides information on how to use Mirantis products to deploy cloud environments. The information is for reference purposes and is subject to change.

Intended audience

This documentation is intended for deployment engineers, system administrators, and developers; it assumes that the reader is already familiar with network and cloud concepts.

Documentation history

The following table lists the released revisions of this documentation:

Revision date

Description

February 8, 2019

Q4`18 GA

Introduction

This guide outlines the post-deployment Day-2 operations for an MCP cloud. It describes how to configure and manage the MCP components, perform different types of cloud verification, and enable additional features depending on your cloud needs. The guide also contains day-to-day maintenance procedures such as how to backup and restore, update and upgrade, or troubleshoot an MCP cluster.

Provision hardware

MCP uses the Ubuntu’s Metal-as-a-Service (MAAS) to provision hardware for an MCP deployment.

MAAS as a bare metal provisioning service requires an IPMI user to manage power state. This should be configured as part of the installation process.

MAAS provides DHCP to the network(s) on which compute nodes reside. Compute nodes then perform PXE boot from the network and you can configure MAAS to provision these PXE booted nodes.

Reinstall MAAS

If your MAAS instance is lost or broken, you can reinstall it. This section describes how to install MAAS from the Ubuntu Server 16.04 qcow2 image.

To reinstall MAAS:

  1. Create a cloud-config disk:

    1. For example, create a configuration file named config-drive.yaml:

      #cloud-config
      debug: True
      ssh_pwauth: True
      disable_root: false
      chpasswd:
        list: |
          root:r00tme
          ubuntu:r00tme
        expire: False
      

      Note

      You must change the default password.

    2. Create the configuration drive:

      export VM_CONFIG_DISK="/var/lib/libvirt/images/maas/maas-config.iso"
      cloud-localds  --hostname maas01 --dsmode local ${VM_CONFIG_DISK}  config-drive.yaml
      
  2. Create a VM system disk using the preloaded qcow2 image. For example:

    wget http://cloud-images.ubuntu.com/releases/xenial/release-20180306/ubuntu-16.04-server-cloudimg-amd64-disk1.img -O \
    /var/lib/libvirt/images/maas/maas-system-backend.qcow2
    
    export VM_SOURCE_DISK="/var/lib/libvirt/images/maas/maas-system.qcow2"
    qemu-img create -b /var/lib/libvirt/images/maas/maas-system-backend.qcow2 -f qcow2 ${VM_SOURCE_DISK} 100G
    
  3. Create a VM using the predefine-vm script. For example:

    export MCP_VERSION="master"
    wget https://github.com/Mirantis/mcp-common-scripts/blob/${MCP_VERSION}/predefine-vm/define-cfg01-vm.sh
    
    chmod 0755 define-vm.sh
    export VM_NAME="maas01.[CLUSTER_DOMAIN]"
    

    Note

    You may add other optional variables that have default values and set them according to your deployment configuration. These variables include:

    • VM_MGM_BRIDGE_NAME="br-mgm"

    • VM_CTL_BRIDGE_NAME="br-ctl"

    • VM_MEM_KB="8388608"

    • VM_CPUS="4"


    The br-mgm and br-ctl values are the names of the Linux bridges. See MCP Deployment Guide: Prerequisites to deploying MCP DriveTrain for details. Custom names can be passed to a VM definition using the VM_MGM_BRIDGE_NAME and VM_CTL_BRIDGE_NAME variables accordingly.

  4. Boot the created VM.

  5. Log in to the VM using the previously defined password.

  6. Proceed with installing MAAS:

    sudo apt-get install maas
    
  7. Configure MAAS as required to complete the installation.

  8. Verify the installation by opening the MAAS web UI:

    http://<MAAS-IP-ADDRESS>:5240/MAAS
    
  9. If you have installed MAAS from the packages, create an initial (administrative) user first to log in to the MAAS web UI:

    sudo maas createadmin --username=<PROFILE> --email=<EMAIL_ADDRESS>
    

Add an SSH key

To simplify access to provisioned nodes, add an SSH key that MAAS will deliver when provisioning these nodes.

To add an SSH key:

  1. In the MAAS web UI, open the user page by clicking on your user name at the top-right conner.

  2. Find the SSH keys section.

  3. From the Source drop-down menu, specify the source of an SSH key using one of the options:

    • Launchpad, specifying your Launchpad ID

    • Github, specifying your Github ID

    • Upload, placing an existing public key to the Public key edit box

  4. Click Import.

Add a boot image

You can select images with appropriate CPU architectures that MAAS will import, regularly sync, and deploy to the managed nodes.

To add a new boot image:

  1. In the MAAS web UI, open the Images page.

  2. Select the releases you want to make available, as well as any architecture.

  3. Click Save selection.

Add a subnet

You can add new networking elements in MAAS such as fabrics, VLANs, subnets, and spaces. MAAS should detect new network elements automatically. Otherwise, you can add them manually.

To add a new subnet:

  1. In the MAAS web UI, open the Subnets page.

  2. Select Subnet in the Add drop-down menu at the top-right corner.

  3. Specify Name, CIDR, Gateway IP, DNS servers, Fabric & VLAN, and Space.

  4. Click Add subnet.

See also

MAAS Networking

Enable DHCP on a VLAN

Before enabling DHCP, ensure that you have the MAAS node network interface properly configured to listen to VLAN.

You can use external DHCP or enable MAAS-managed DHCP. Using an external DHCP server for enlistment and commissioning may work but is not supported. High availability also depends upon MAAS-managed DHCP.

To enable MAAS-managed DHCP on a VLAN:

  1. In the MAAS web UI, open the Subnets page.

  2. Click on the VLAN you want to enable DHCP on.

  3. In the VLAN configuration panel, you find DHCP Disabled.

  4. From the Take action drop-down menu at the top-right corner, select the Provide dhcp item.

  5. In the Provide DHCP panel, verify or change the settings for Rack controller, Subnet, Dynamic range start IP, Dynamic range end IP.

  6. Click Provide dhcp.

  7. In the VLAN configuration panel, verify that DHCP is enabled.

Enable device discovery

MAAS provides passive and active methods of device discovery.

Passive methods include:

  • Listening to ARP requests

  • DNS advertisements

To enable passive device discovery:

  1. In the MAAS web UI, open the MAAS dashboard.

  2. On the MAAS dashboard, turn on the Discovery enabled switch.

  3. Verify if you can see the discovered devices on the MAAS dashboard.

Active methods include active subnet mapping that forces MAAS to discover nodes on all the subnets with enabled active mapping using an active subnet mapping interval value.

To enable active subnet mapping:

  1. In the MAAS web UI, open the Settings page.

  2. Go to the Device Discovery section.

  3. From the drop-down menu, select the value for Active subnet mapping interval.

  4. Open the Subnets page.

  5. Click the subnet you want to enable active mapping on.

  6. In the Subnet summary section, turn on the Active mapping switch.

See also

Device discovery

Add a new node

Using MAAS, you can add new nodes in an unattended way called enlistment or manually, when enlistment does not work.

MAAS enlistment uses a combination of DHCP with TFTP and PXE technologies.

Note

  • To boot a node over PXE, enable netboot or network boot in BIOS.

  • For KVM virtual machines, specify the boot device as network in the VM configuration file and add the node manually. You need to configure the Virsh power type and provide access to the KVM host as described in BMC Power Types.

  • To ignore the already deployed machines and avoid issues with the wait_for_ready Salt state failure, you may need to set ignore_deployed_machines to true in the Reclass model:

    parameters:
      maas:
        region:
          ignore_deployed_machines: true
    

To add a new node manually:

  1. In the MAAS web UI, open the Nodes page.

  2. Click the Add hardware drop-down menu at the top-right corner and select Add machine.

  3. Set Machine name, Domain, Architecture, Minimum Kernel, Zone, MAC Address, Power type.

    Note

    See Configure power management for more details on power types.

  4. Click Save machine.

MAAS will add the new machine to the list of nodes with the status Comissioning.

Configure power management

MAAS supports many types of power control, from standard IPMI to non-standard types such as virsh, VMWare, Nova, or even completely manual ones that require operator intervention. While most servers may use their own custom vendor management, for example, iLO or DRAC, standard IPMI controls are also supported, and you can use IPMI as shown in the following example.

To configure IPMI node power management type:

  1. In the MAAS web UI, open the Nodes page.

  2. In the list of nodes, select the one you want to configure.

  3. In the machine configuration window, go to the Power section and click the Edit button at the top-right conner.

  4. Select IPMI for Power type from the drop-down menu.

  5. Specify parameters for the IPMI power type: Power driver, IP address, Power user, Power password, and Power MAC.

  6. Click Save changes.

After saving the changes, MAAS will verify that it can manage the node through IPMI.

See also

BMC Power Types

Commission a new node

When you add a new node, MAAS automatically starts commissioning the node once configuration is done. Also, you can commission a node manually.

To commision a new node manually:

  1. In the MAAS web UI, open the Nodes page.

  2. In the list of nodes, click the node you want to configure.

  3. In the node configuration window, click Take action at the top-right conner and select Commission.

  4. Select additional options by setting the appropriate check boxes to allow SSH access and prevent machine from powering off, retain network and storage configuration if you want to preserve data on the node. For example, if a node comes from an existing cloud with an instance on it.

  5. Click Go to start commissioning.

  6. Verify that the node status has changed to Ready and hardware summary in the node configuration window has been filled with values other than zeros, which means commisionining was successful.

Note

Use MAAS CLI to commission a group of machines.

Deploy a node

Once a node has been commissioned, you can deploy it.

The deployment operation includes installing an operating system and copying the SSH keys imported to MAAS. As a result, you can access the deployed node through SSH using the default user account ubuntu.

To deploy a node:

  1. In the MAAS web UI, open the Nodes page.

  2. In the list of nodes, verify that the commisioned node is in the Ready state.

  3. Click on the node to open the configuration window.

  4. In the node configuration window, click the Take action button at the top-right conner and select the Deploy item.

  5. Specify the OS, release, and kernel options.

  6. Click Go to start the deployment.

Once the deployment is finished, MAAS will change the node status to Deployed.

Redeploy a node

To redeploy a node:

  1. In the MAAS web UI, open the Nodes page.

  2. In the list of nodes, select the node you want to redeploy.

  3. In the node configuration window, click Take action at the top-right coner and select Release.

  4. Verify that the node status has changed to Ready.

  5. Redeploy the node as described in Deploy a node.

Delete a node

Warning

Deleting a node in MAAS is a permanent operation. The node will be powered down, and removed from the MAAS database. All existing configuration done on the node such as name, hardware specs, and power control type will be permanently lost. The node can be readded again, however, it will be unrecognized by MAAS. In such case, you will need to add the node as a new one and reconfigure from scratch.

To delete a node:

  1. In the MAAS web UI, open the Nodes page.

  2. In the list of nodes, select the node you want to delete.

  3. In the node configuration window, click the Take action button at the top-right coner and select Delete.

  4. Click Go to confirm the deletion.

SaltStack operations

SaltStack is an orchestration and configuration platform that implements the model-driven architecture (MDA) that you can use to turn services configuration described using Reclass models and Salt Formulas into actual services running on nodes.

Salt Minion operations

Run a command on a node

The Salt integrated cmd.run function is highly flexible and enables the operator to pass nearly any bash command to a node or group of nodes and functions as a simple batch processing tool.

To run a command on a node, execute:

salt '[node]' cmd.run '[cmd]'

List services on a node

The Salt integrated service.get_all function shows available services on the node.

To list services on a node, run:

salt '<NODE_NAME>' service.get_all

Restart a service on a node

You can use the Salt integrated service.restart function to restart services.

To restart a service on a node, run:

salt '<NODE_NAME>' service.restart <SERVICE_NAME>

Note

If you do not know the name of the service or unsure which services are available, see the List services on a node section.

Verify Minions have joined the Master

If you are not sure whether or not a Minion has been joined the Master, verify the output of salt-key. The output of the command lists all known keys in the following states:

  • Accepted

    Nodes in this state have successfully joined the Salt Master

  • Denied

    Nodes in this state have not successfully joined because of a bad, duplicate, or rejected key. Nodes in this state require additional user action to join the Master.

  • Rejected

    Nodes in this state have been explicitly rejected by an administrator.

To verify Minions have joined the Master, run:

salt-key

Example of a system response:

Accepted Keys:
<NODE_NAME>.domain.local
… [snip] ...
Execute salt-key:
Denied Keys:
Unaccepted Keys:
Rejected Keys:

Ping a Minion from the Master

You can ping all properly running Salt Minion nodes from the Salt Master. To verify that you have network availability between Salt Minion nodes and the Salt Master node, use the test.ping command.

To ping a Minion from the Master:

salt '<NODE_NAME>.domain.local' test.ping

Example of a system response:

<NODE_NAME>
 True

Salt States operations

Salt State is a declarative or imperative representation of a system state.

List available States of a Minion

A Salt Minion node can have different States.

To list available States of a Minion, execute on a node:

salt-call state.show_top

Example of a system response:

local:
 ----------
 base:
     - linux
     - ntp
     - salt
     - heka
     - openssh
     - nova
     - opencontrail
     - ceilometer

Apply a State to a Minion

You can apply changes to a Minion’s State from the Salt Master.

To apply a State to a Minion, run:

salt '<NODE_NAME>' state.sls <STATE_NAME>

Salt Formula operations

Salt Formula is a declarative or imperative representation of a system configuration.

Verify and validate a Salt Formula

You can verify and validate a new Salt Formula before applying it by running a quick test for invalid Jinja, YAML, and a Salt state.

To verify a SLS file in a Salt Formula, run:

salt '*' state.show_sls <SLS_FILE_NAME>

To validate the Salt Formula, run in the test-only (dry-run) mode using the test option:

salt '*' state.apply test

Apply a Salt Formula

This section covers how you can test and apply a Salt Formula.

To apply all configured states (highstate) from a Salt Formula to all Minions, run on the Salt Master:

salt '*' state.apply

Note

This command is equal to:

salt '*' state.highstate

To apply individual SLS files in a Salt Formula, run:

salt '*' state.apply <SLS_FILE_1>,<SLS_FILE_2>

Warning

Applying Salt Formulas on more than 100 nodes may result in numerous failures.

Note

SaltStack runs new states in parallel leading to temporary out of service that may affect end users. To avoid taking down services on all the nodes at the same time, you can stagger highstates in a batch mode.

To apply a Salt Formula on a big number of nodes, for example, more than 100 nodes, follow one of the approaches below.

  • Use the --batch-size or -b flags to specify the number of nodes to have Salt apply a state in parallel:

    salt --batch-size <NUMBER_OF_NODES> '*' state.apply
    
  • Specify a percentage of nodes to apply a highstate on:

    salt -b <PERCENTAGE> '*' state.apply
    
  • Use node name conventions in the form of <GROUP>.<NODE_TYPE_NAME><NUM> to run a highstate by a pattern. For example: group1.cmp001:

    salt 'group1.cmp*' state.highstate
    
  • Use Node Groups that you can define in the Salt Master configuration file /etc/salt/master. To run a highstate on nodes within a Node Group, run:

    salt -N <GROUP_NODE> state.apply
    
  • Use Grains for grouping nodes specifying a grain variable in the /etc/salt/grains configuration file and then specify the grain value in the Salt command to apply a highstate for the nodes that have this grain value assigned:

    salt -G <GRAIN_NAME>:<GRAIN_VALUE> state.apply
    

Note

You can use --batch-size flag together with Node Groups and Grains. For example:

salt --batch-size 10% -N computes1 state.apply
salt -b 5 -N compute:compute1 state.apply

Replace the Salt Master keys

In case your Salt Master keys have been compromised, you can replace both Salt Master CA and RSA SSH keys. The replacement procedure of the Salt Master keys does not affect your cloud environment, only the Salt structure is updated.

Salt Master keys structure

File path

Description

/etc/salt/minion.d/_pki.conf

PKI configuration file pointing to the current Salt Master CA certificate path

/etc/pki/ca/salt_master_ca/

Catalog for the Salt Master CA certificate

/etc/pki/ca/salt_master_ca/ca.crt

Salt Master CA certificate

/etc/pki/ca/salt_master_ca/ca.key

Salt Master CA certificate key

/etc/pki/ca/salt_master_ca/certs

Catalog for the Salt minion certificates signed by the Salt Master CA certificate

/etc/pki/ca/salt_master_ca/certs/XX:XX:XX:XX:XX:XX:XX:XX.crt

Salt minion certificate signed by CA

  • /etc/salt/pki/minion/minion.pub

  • /etc/salt/pki/minion/minion.pem

  • /etc/salt/pki/minion/minion_master.pub

Salt Master SSH RSA private and public keys for Salt minion

  • /etc/salt/pki/master/master.pem

  • /etc/salt/pki/master/master.pub

Salt Master SSH RSA private and public keys for Salt Master

/etc/salt/pki/master/minions/ctl01.example.int

RSA SSH minion key for communication with Salt Master. Equals to /etc/salt/pki/minion/minion.pub

Replace the Salt Master and Salt minions SSH RSA keys

This section provides the instruction of how to replace the Salt Master and the Salt minions SSH RSA keys.

To replace Salt Master and Salt minions SSH RSA keys:

  1. Log in to the Salt Master node.

  2. Verify that all nodes are available:

    salt \* test.ping
    
  3. Create classes/cluster/<cluster-name>/infra/minions-maintenance.yml with the following content:

    parameters:
      _param:
        char_number_sign: "#"
      linux:
        system:
          file:
            restart-minion.sh:
              name: /usr/local/bin/restart-minion.sh
              user: root
              group: root
              mode: 750
              contents: |
                ${_param:char_number_sign}!/bin/bash
                /usr/sbin/service salt-minion stop
                rm -f /etc/salt/pki/minion/minion*;
                /usr/sbin/service salt-minion start
          job:
            restart-minion:
              enabled: True
              command: /usr/local/bin/restart-minion.sh
              user: root
              minute: '*/5'
    
  4. Include the minions-maintenance class in the infra/init.yml file:

    classes:
    ...
    - cluster.<cluster-name>.infra.minions-maintenance
    
  5. Put all Salt minions into the maintenance mode:

    salt \* state.sls linux.system.file,linux.system.job
    

    The command above will cause all Salt minions to remove their keys and restart each 5 minutes.

  6. Count your minions:

    MINIONS_NUMBER=$(ls /etc/salt/pki/master/minions/ -1 | wc -l)
    
  7. Verify that all minions are put into the maintenance mode by checking the diff between /master/minions/ and master/minions_denied/:

    diff <(ls /etc/salt/pki/master/minions/ -1 | wc -l) \
    <(ls /etc/salt/pki/master/minions_denied/ -1 | wc -l)
    

    Start the verification at the beginning of the zero or fifth minute to have enough time to purge old minions keys. Proceed only if the diff is empty. If you see the diff for more than 10 minutes, some minions are rejected to execute the cron job. Identify the root cause of the issue and resolve it before proceeding.

  8. Stop the Salt Master node:

    service salt-master stop
    
  9. Change directory to the Salt Master key:

    cd /etc/salt/pki/master
    
  10. Remove the Salt Master key:

    rm -f master.p*
    
  11. Generate a new key without a password:

    ssh-keygen -t rsa -b 4096 -f master.pem
    
  12. Remove the RSA public key for the new key as Salt Master does not require it:

    rm -f master.pem.pub
    
  13. Generate the .pem public key for the Salt Master node:

    openssl rsa -in master.pem -pubout -out master.pub
    

    Note

    Press Enter for the empty password.

  14. Remove the minions list on the Salt Master node:

    salt-key -y -d '*'
    
  15. Start the Salt Master node:

    service salt-master start
    
  16. Verify that the minions are present:

    Note

    The minions should register on the first or sixth minute.

    salt-key -L
    
  17. Verify that the current minions count is the same as in the step 6:

    ls /etc/salt/pki/master/minions/ -1 | wc -l
    echo $MINIONS_NUMBER
    
  18. Disable the maintenance mode for minions by disabling the cron job in classes/cluster/<cluster-name>/infra/minions-maintenance.yml:

    job:
      restart-minion:
        enabled: False
    
  19. Update your minions:

    salt \* state.sls linux.system.job
    
  20. Remove the minions-maintenance class from the infra/init.yml file:

    classes:
    ...
    # Remove the following line
    - cluster.<cluster-name>.minions-maintenance
    
  21. Remove the minions-maintenance pillar definition from the Reclass model:

    rm -f classes/cluster/<cluster-name>/infra/minions-maintenance.yml
    

Replace the Salt Master CA certificates

This section provides the instruction on how to replace the Salt Master CA certificates.

To replace the Salt Master CA certificates:

  1. Log in to the Salt Master node.

  2. Back up the running Salt configuration in case the rollback is required:

    tar cf /root/salt-backup.tar /etc/salt /etc/pki/ca/salt_master_ca/
    gzip -9 /root/salt-backup.tar
    
  3. List all currently issued certificates.

    Currently, the index file for Salt Master CA does not exist. Therefore, you can list all certificates and find the latest ones using the salt_cert_list.py script:

    Note

    The script is available within Mirantis from the mcp-common-scripts GitHub repository.

    ./salt_cert_list.py
    

    Example of system response:

    /etc/pki/ca/salt_master_ca/certs/18:63:9E:A6:F3:7E:10:5F.crt (proxy, 10.20.30.10, horizon.multinode-ha.int)
    /etc/pki/ca/salt_master_ca/certs/EB:51:7C:DF:CE:E7:90:52.crt (10.20.30.10, 10.20.30.10, *.10.20.30.10)
    /etc/pki/ca/salt_master_ca/certs/15:DF:66:5C:8D:8B:CF:73.crt (internal_proxy, mdb01, mdb01.multinode-ha.int, 192.168.2.116, 192.168.2.115, 10.20.30.10)
    /etc/pki/ca/salt_master_ca/certs/04:30:B0:7E:76:98:5C:CC.crt (rabbitmq_server, msg01, msg01.multinode-ha.int)
    /etc/pki/ca/salt_master_ca/certs/26:16:E7:51:E4:44:B4:65.crt (mysql_server, 192.168.2.53, 192.168.2.50, dbs03, dbs03.multinode-ha.int)
    /etc/pki/ca/salt_master_ca/certs/78:26:2F:6E:2E:FD:6A:42.crt (internal_proxy, ctl02, 192.168.2.12, 10.20.30.10, 192.168.2.10)
    ...
    
  4. Update classes/cluster/<cluster_name>/infra/config.yml with the required values for the Salt Master CA. For example:

    parameters:
      _param:
        salt_minion_ca_country: us
        salt_minion_ca_locality: New York
        salt_minion_ca_organization: Planet Express
        salt_minion_ca_days_valid_authority: 3650
        salt_minion_ca_days_valid_certificate: 365
    
  5. Replace the Salt Master CA certificates:

    rm -f /etc/pki/ca/salt_master_ca/ca*
    salt-call state.sls salt.minion.ca -l debug
    
  6. Publish the Salt Master CA certificates as described in Publish CA certificates.

  7. Replace the certificates in your cloud environment according to the list of certificates obtained in the step 3 of this procedure as described in Manage certificates for the affected services.

DriveTrain operations

This section describes the main capabilities of DriveTrain, the MCP lifecycle management engine.

Job configuration history

The DriveTrain Jenkins provides the capability to inspect the history of jobs configuration changes using the Job Configuration History plugin. This plugin captures and stores the changes to all jobs configured in Jenkins and enables the DriveTrain administrator to view these changes. It allows identifying the date of the change, the user that created the change, and the content of the change itself.

To use the Job Configuration History plugin:

  1. Log in to the Jenkins web UI as an administrator using the FQDN of your cloud endpoint and port 8081. For example, https://cloud.example.com:8081.

  2. Navigate to Job Config History > Show job configs only.

  3. Click the required job and review the list of recorded changes.

Alternatively, you can access the history of a job by clicking Job Config History from the particular job view itself.

Abort a hung build in Jenkins

This section provides the instruction on how to abort the hung Jenkins build if it does not restore after the Jenkins dedicated node is restared, for example.

To abort a hung build, select from the following options

  • Abort the job build from the Jenkins web UI:

    1. Log in to the Jenkins web UI as an Administrator using the FQDN of your cloud endpoint and the 8081 port. For example, https://cloud.example.com:8081.

    2. Navigate to Manage Jenkins > Script Console.

    3. Run the following script setting the job name and number of the hung build accordingly:

      def build = Jenkins.instance.getItemByFullName("jobName").getBuildByNumber(jobNumber)
      build.doStop()
      build.doKill()
      
  • Abort the job build from a cid node (if the previous option did not help):

    1. Log in to any cid node.

    2. Run:

      cd /srv/volumes/jenkins/jobs/<job-name>/builds/
      rm -rf <hung-build-number>
      
    3. Log in to the Jenkins web UI as an Administrator using the FQDN of your cloud endpoint and the 8081 port. For example, https://cloud.example.com:8081.

    4. Navigate to Manage Jenkins > Reload Configuration from Disk, click to reload. Or restart the Jenkins instance.

Enable Jenkins audit logging

This section instructs you on how to enable the audit logging in Jenkins by enabling the Audit Trail Jenkins plugin. The plugin allows keeping a log of the users who performed particular Jenkins operations, such as managing and using jobs.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Note

If Jenkins is disabled on the Salt Master node, skip the step 3 of the procedure below.

To setup Audit logging in Jenkins:

  1. Log in to the Salt Master node.

  2. Open the cluster level of your deployment model.

  3. In the cicd/control/leader.yml file, configure any of three logger types that include console, file, and syslog.

    Note

    By default, only the console output is collected by Fluentd if enabled.

    Pillars examples:

    1. For the console logger:

      parameters:
        jenkins:
          client:
            audittrail:
              loggers:
                console_logger:
                  type: console
                  output: STD_OUT
                  date_format: "yyyy-MM-dd HH:mm:ss:SSS"
                  log_prefix: ""
      

      Note

      The date_format and log_prefix parameters in the example above are defaults and can be skipped.

    2. For the file logger:

      parameters:
        jenkins:
          client:
            audittrail:
              loggers:
                file_logger:
                  type: file
                  log: /var/jenkins_home/file_logger.log
                  limit: 100
                  count: 10
      

      Note

      The limit parameter stands for the file limit size in MB. The count parameter stands for the number files to keep.

    3. For the syslog logger:

      parameters:
        jenkins:
          client:
            audittrail:
              loggers:
                syslog_logger:
                  type: syslog
                  syslog_server_hostname: 'syslog.host.org'
                  syslog_server_port: 514
                  syslog_facility: SYSLOG
                  app_name: jenkins
                  message_hostname: ""
                  message_format: RFC_3164
      
  4. To configure the audit Logging for Jenkins on the Salt Master node, add the similar pillars to infra/config/jenkins.yml.

  5. Refresh pillars:

    salt -C 'I@jenkins:client' saltutil.refresh_pillar
    
  6. Apply the changes:

    salt -C 'I@jenkins:client:audittrail' state.apply jenkins.client.audittrail
    

Jenkins Matrix-based security authorization

The DriveTrain Jenkins uses Matrix-based security authorization by default. It allows you to grant specific permissions to users and groups. Jenkins uses DriveTrain OpenLDAP server as an identity provider and authentication server.

By default, the Jenkins server includes the following user groups:

  • admins

    Contains administrative users with the Jenkins Administer permission.

  • Authenticated Users

    Includes all users authenticated through the DriveTrain OpenLDAP server. This group has no permissions configured by default.

The Matrix-based security plugin enables the operator to configure the following types of permissions:

  • Overall

    Either Administer or Read permissions can be set overall.

  • Credentials

    Permissions to create, delete, update, and view authentication credentials, and manage domains.

  • Gerrit

    Permissions to manually trigger and retrigger Gerrit integration plugin to run specific jobs normally initiated by the plugin.

  • Agents

    Permissions to manage Jenkins agents on worker nodes.

  • Job

    Permissions for specific operations on Jenkins jobs, including build creation, configuration, and execution.

  • Run

    Permissions to run and rerun jobs.

  • View

    Permissions to manage views in the Jenkins UI.

  • SCM

    Permissions to use SCM tags.

  • Metrics

    Permissions to view and configure metrics.

  • Lockable resources

    Permissions to reserve and unlock lockable resources manually.

  • Artifactory

    Permissions to use the Artifactory integration plugin (only if Artifactory is installed).

To configure the Matrix-based Security Authorization:

  1. Log in to the Jenkins web UI as an administrator using the FQDN of your cloud endpoint and port 8081. For example, https://cloud.example.com:8081.

  2. Navigate to Manage Jenkins > Configure Global Security.

  3. Scroll to the Authorization section to view and change the Matrix-based security settings.

Remove executors on Jenkins master

The DriveTrain Jenkins enabled on the Salt Master node enables you to run any job on any Jenkins slave or master. Though, running a job on Jenkins master can lead to job failures since Jenkins master is not intended to be a job executor.

Starting from the MCP 2019.2.4 maintenance update, MCP disables executors on Jenkins master by default to prevent failures of the jobs that run on the Salt Master node. For the MCP versions earlier than 2019.2.4, Mirantis recommends setting zero executors on Jenkins master to disable jobs scheduling on this agent node as described below.

Note

If Jenkins is disabled on the Salt Master node (for details, refer to MCP Deployment Guide: Deploy CI/CD), you can skip the steps below or simply update your cluster configuration without applying the Salt states.

To set zero executors on Jenkins master on the Salt Master node:

  1. Log in to the Salt Master node.

  2. In ./classes/cluster/<cluster_name>/infra/config/jenkins.yml, add the following pillar data:

    parameters:
      _param:
        jenkins:
          client:
            ...
            node:
              master:
                num_executors: 0
               ...
    
  3. Refresh pillars:

    salt -C 'I@salt:master' saltutil.refresh_pillar
    
  4. Apply the changes:

    salt -C 'I@salt:master' state.apply jenkins.client
    

Remove anonymous access to Jenkins on the Salt Master node

The DriveTrain Jenkins enabled on the Salt Master node is configured to allow anonymous users to access the Jenkins web UI including the listing of the Jenkins jobs and builds in the web UI.

For security reasons, starting from the MCP 2019.2.4 maintenance update, by default, only authorized users have access to Jenkins on the Salt Master node. For the MCP versions earlier than 2019.2.4, Mirantis recommends configuring Jenkins as described below.

Note

If Jenkins is disabled on the Salt Master node (for details, refer to MCP Deployment Guide: Deploy CI/CD), you can skip the steps below or simply update your cluster configuration without applying the Salt states.

To remove anonymous access to Jenkins on the Salt Master node:

  1. Log in to the Salt Master node.

  2. In ./classes/cluster/<cluster_name>/infra/config/jenkins.yml, replace anonymous with authenticated for jenkins_security_matrix_read:

    parameters:
      _param:
        jenkins_security_matrix_read:
        - authenticated
    
  3. Refresh pillars:

    salt -C 'I@salt:master' saltutil.refresh_pillar
    
  4. Apply the changes:

    salt -C 'I@salt:master' state.apply jenkins.client
    

Use SSH Jenkins slaves

By default, Jenkins uses Java Network Launch Protocol (JNLP) for Jenkins slave connection. Starting from the MCP 2019.2.5 maintenance update, you can set up SSH connection for Jenkins slaves instead of JNLP using the steps below.

Note

If Jenkins is disabled on the Salt Master node (for details, refer to MCP Deployment Guide: Deploy CI/CD), skip the steps 2 and 3 of the procedure below.

To use SSH connection instead of JNLP for Jenkins slaves:

  1. Log in to the Salt Master node.

  2. Configure Jenkins Master for the Salt Master node to use SSH Jenkins slaves:

    1. Verify your existing SSH keys for Jenkins admin key:

      salt-call pillar.get _param:jenkins_admin_public_key_generated
      salt-call pillar.get _param:jenkins_admin_private_key_generated
      

      The system output must be not empty.

      If you do not have SSH keys, generate ones:

      ssh-keygen
      
    2. In ./classes/cluster/<cluster_name>/infra/config/jenkins.yml:

      1. Replace the system.docker.swarm.stack.jenkins.slave_single or system.docker.swarm.stack.jenkins.jnlp_slave_single class (the one that is present in model) with the following class:

        classes:
        ...
        - system.docker.swarm.stack.jenkins.ssh_slave_single
        
      2. Remove the following classes if present:

        classes:
        ...
        - system.docker.client.images.jenkins_master
        - system.docker.client.images.jenkins_slave
        
      3. Change the Jenkins slave type to ssh instead of jnlp:

        parameters:
          ...
          jenkins:
            client:
              node:
                slave01:
                  launcher:
                    type: ssh
        
      4. Add the SSH keys parameters to the parameters section:

        • If you use existing SSH keys:

          parameters:
            _param:
              ...
              jenkins_admin_public_key: ${_param:jenkins_admin_public_key_generated}
              jenkins_admin_private_key: ${_param:jenkins_admin_private_key_generated}
              ...
          
        • If you generated new SSH keys in the step 2.1:

          parameters:
            _param:
              ...
              jenkins_admin_public_key: <ssh-public-key>
              jenkins_admin_private_key: <ssh-private-key>
              ...
          
  3. Remove the JNLP slave from Jenkins on Salt Master node:

    1. Log in to Salt Master node Jenkins web UI.

    2. Navigate to Manage Jenkins > Manage nodes.

    3. Select slave01 > Delete agent. Click yes to confirm.


  4. Configure Jenkins Master for the cid nodes to use SSH Jenkins slaves:

    1. Verify that the Jenkins SSH key is defined in the Reclass model:

      salt 'cid01*' pillar.get _param:jenkins_admin_public_key
      salt 'cid01*' pillar.get _param:jenkins_admin_private_key
      
    2. In ./classes/cluster/<cluster_name>/cicd/control/leader.yml:

      1. Replace the system.docker.swarm.stack.jenkins class with system.docker.swarm.stack.jenkins.master and the system.docker.swarm.stack.jenkins.jnlp_slave_multi class with system.docker.swarm.stack.jenkins.ssh_slave_multi if present, or add system.docker.swarm.stack.jenkins.ssh_slave_multi explicitly.

      2. Add the system.jenkins.client.ssh_node class right below the system.jenkins.client.node class:

        classes:
        ...
        - system.jenkins.client.node
        - system.jenkins.client.ssh_node
        
  5. Remove the JNLP slaves from Jenkins on the cid nodes:

    1. Log in to cid Jenkins web UI.

    2. Navigate to Manage Jenkins > Manage nodes.

    3. Delete slave01, slave02 and slave03 using the menu. For example: slave01 > Delete agent. Click yes to confirm.


  6. Refresh pillars:

    salt -C 'I@jenkins:client' saltutil.refresh_pillar
    salt -C 'I@docker:client' saltutil.refresh_pillar
    
  7. Pull the ssh-slave Docker image:

    salt -C 'I@docker:client:images' state.apply docker.client.images
    
  8. Apply the changes:

    salt -C 'I@jenkins:client and I@docker:client' state.apply docker.client
    salt -C 'I@jenkins:client' state.apply jenkins.client
    

Enable HTTPS access from Jenkins to Gerrit

By default, Jenkins uses the SSH connection to access Gerrit repositories. This section explains how to set up the HTTPS connection from Jenkins to Gerrit repositories.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To enable access from Jenkins to Gerrit through HTTPS:

  1. Log in to the Salt Master node.

  2. Open the cluster level of your deployment model.

  3. In the cicd/control/leader.yml file:

    1. Replace the system.jenkins.client.credential.gerrit class with the system.jenkins.client.credential.gerrit_http class:

      classes:
      ...
      - system.jenkins.client.credential.gerrit_http
      
    2. Redefine the jenkins_gerrit_url parameter as follows:

      jenkins_gerrit_url: "https://${_param:haproxy_gerrit_bind_host}:${_param:haproxy_gerrit_bind_port}"
      
  4. Refresh pillars:

    salt -C 'I@jenkins:client and not I@salt:master' saltutil.refresh_pillar
    
  5. Apply the changes:

    salt -C 'I@jenkins:client and not I@salt:master' state.apply jenkins.client
    

Manage secrets in the Reclass model

MCP uses the GPG encryption to protect sensitive data in the Git repositories of the Reclass model. The private key from the encrypted data is stored on the Salt Master node and is available to the root user only. Usually, the data stored in the secrets.yml files located in the /srv/salt/reclass/cluster directory is encrypted. The decryption key is located in a keyring in /etc/salt/gpgkeys.

Note

MCP uses the secrets file name for organizing sensitive data management. If required, you can encrypt data in other files, as well as use unencrypted data in the secrets.yml files.

The secrets encryption feature is not enabled by default. To enable the feature, define secrets_encryption_enabled: 'True' in the Cookiecutter context before the deployment. See MCP Deployment Guide: Infrastructure related parameters: Salt Master for the details.

To change a password:

  1. Get the ID of the private key in question:

    # GNUPGHOME=/etc/salt/gpgkeys gpg --list-secret-keys
    

    The machine-readable version of the above command:

    # GNUPGHOME=/etc/salt/gpgkeys gpg --list-secret-keys --with-colons | awk -F: -e '/^sec/{print $5}'
    
  2. Encrypt the new password:

    # echo -ne <new_password> | GNUPGHOME=/etc/salt/gpgkeys gpg --encrypt --always-trust -a -r <key_id>
    
  3. Add the new password to secrets.yml.

To decrypt the data:

To get the decoded value, pass the encrypted value to the command:

# GNUPGHOME=/etc/salt/gpgkeys gpg --decrypt

To change the secret encryption private key:

  1. Add a new key to keyring in /etc/salt/gpgkeys using one of the following options:

    • Import the existing key:

      # GNUPGHOME=/etc/salt/gpgkeys gpg --import < <key_file>
      
    • Create a new key:

      # GNUPGHOME=/etc/salt/gpgkeys gpg --gen-key
      
  2. Replace all encrypted fields in all secrets.yml files with the encrypted value for new key_id.

Configure allowed and rejected IP addresses for the GlusterFS volumes

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section provides the instruction on how to configure the list of allowed and rejected IP addresses for the GlusterFS volumes.

By default, MCP restricts the access to the control network for all preconfigured GlusterFS volumes.

To configure the GlusterFS authentication:

  1. Log in to the Salt Master node.

  2. Open your project Git repository with the Reclass model on the cluster level.

  3. In the infra/glusterfs.yml file, configure the GlusterFS authentication depending on the needs of your MCP deployment:

    • To adjust the list of allowed and rejected IP addresses on all preconfigured GlusterFS volumes, define the glusterfs_allow_ips and glusterfs_reject_ips parameters as required:

      parameters:
        _param:
          glusterfs_allow_ips: <comma-seprated list of IPs>
          glusterfs_reject_ips: <comma-seprated list of IPs>
      

      Note

      You can use the \* wildcard to specify the IP ranges.

      Configuration example:

      parameters:
        _param:
          glusterfs_allow_ips: 10.0.0.1, 192.168.1.*
          glusterfs_reject_ips: 192.168.1.201
      

      The configuration above allows the access to all GlusterFS volumes from 10.0.0.1 and all IP addresses in the 192.168.1.0/24 network except for 192.168.1.201.

    • To change allowed and rejected IP addresses for a single volume:

      parameters:
        glusterfs:
          server:
            volumes:
              <volume_name>:
                options:
                  auth.allow: <comma-seprated_list_of_IPs>
                  auth.reject: <comma-seprated_list_of_IPs>
      
    • To define the same access-control lists (ACL) as for all preconfigured GlusterFS volumes to a custom GlusterFS volume, define the auth.allow and auth.reject options for the targeted volume as follows:

      auth.allow: ${_param:glusterfs_allow_ips}
      auth.reject: ${_param:glusterfs_reject_ips}
      
  4. Apply the changes:

    salt -I 'glusterfs:server:role:primary' state.apply glusterfs
    

Manage users in OpenLDAP

DriveTrain uses OpenLDAP to provide authentication and metadata for MCP users. This section describes how to create a new user entry in the OpenLDAP service through the Reclass cluster metadata model and grant the user permissions to access Gerrit and Jenkins.

To add a user to an OpenLDAP server:

  1. Log in to the Salt Master node.

  2. Check out the latest version of the Reclass cluster metadata model from the Git repository for your project.

  3. Create a new directory called people in classes/cluster/<CLUSTER_NAME>/cicd/:

    mkdir classes/cluster/<cluster_name>/cicd/people
    

    New user definitions will be added to this directory.

  4. Create a new YAML file in the people directory for a new user. For example, joey.yml:

    touch classes/cluster/<cluster_name>/cicd/people/joey.yml
    
  5. In the newly created file, add the user definition. For example:

    parameters:
      _param:
        openldap_pw_joey: "<ENCRYPTED_PASSWORD>"
      openldap:
        client:
          entry:
            people:
              entry:
                jdoe:
                  attr:
                    uid: joey
                    userPassword: ${_param:openldap_pw_joey}
                    uidNumber: 20600
                    gidNumber: 20001
                    gecos: "Joey Tribbiani"
                    givenName: Joey
                    sn: Tribbiani
                    homeDirectory: /home/joey
                    loginShell: /bin/bash
                    mail: joey@domain.tld
                  classes:
                    - posixAccount
                    - inetOrgPerson
                    - top
                    - shadowAccount
    

    Parameters description:

    • openldap_pw_joey

      The user password for the joey user that can be created using the following example command:

      echo "{CRYPT}$(mkpasswd --rounds 500000 -m sha-512 \
        --salt `head -c 40 /dev/random | base64 | sed -e 's/+/./g' \
        |  cut -b 10-25` 'r00tme')"
      

      Substitute r00tme with a user encrypted password.

    • uid

      The case-sensitive user ID to be used as a login ID for Gerrit, Jenkins, and other integrated services.

    • userPassword: ${_param:openldap_pw_joey}

      The password for the joey user, same as the openldap_pw_joey value.

    • gidNumber

      An integer uniquely identifying a group in an administrative domain, which a user should belong to.

    • uidNumber

      An integer uniquely identifying a user in an administrative domain.

  6. Add the new user definition from joey.yml as a class in classes/cluster/<CLUSTER_NAME>/cicd/control/leader.yml:

    classes:
      ...
      - cluster.<CLUSTER_NAME>.cicd.control
      - cluster.<CLUSTER_NAME>.cicd.people.joey
    

    By defining the cluster level parameters of the joey user and including it in the classes section of cluster/<CLUSTER_NANE>/cicd/control/leader.yml, you import the user data to the cid01 node inventory, although the parameter has not been rendered just yet.

  7. Commit the change.

  8. Update the copy of the model on the Salt Master node:

    sudo git -C /srv/salt/reclass pull
    
  9. Synchronize all Salt resources:

    sudo salt '*' saltutil.sync_all
    
  10. Apply the changes:

    sudo salt 'cid01*' state.apply openldap
    

    Example output for a successfully created user:

            ID: openldap_client_cn=joey,ou=people,dc=deploy-name,dc=local
      Function: ldap.managed
        Result: True
       Comment: Successfully updated LDAP entries
       Started: 18:12:29.788665
      Duration: 58.193 ms
       Changes:
                ----------
                cn=joey,ou=people,dc=deploy-name,dc=local:
                    ----------
                    new:
                        ----------
                        cn:
                            - joey
                        gecos:
                            - Joey Tribbiani
                        gidNumber:
                            - 20001
                        givenName:
                            - Joey
                        homeDirectory:
                            - /home/joey
                        loginShell:
                            - /bin/bash
                        mail:
                            - joey@domain.tld
                        objectClass:
                            - inetOrgPerson
                            - posixAccount
                            - shadowAccount
                            - top
                        sn:
                            - Tribbiani
                        uid:
                            - joey
                        uidNumber:
                            - 20060
                        userPassword:
                            - {CRYPT}$6$rounds=500000$KaJBYb3F8hYMv.UEHvc0...
                    old:
                        None
    
    Summary for cid01.domain.tld
    ------------
    Succeeded: 7 (changed=1)
    Failed:    0
    ------------
    Total states run:     7
    Total run time: 523.672 ms
    

Disable LDAP authentication on host OS

This section describes how to disable LDAP authentication on a host operating system.

To disable LDAP authentication:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In cluster/<cluster_name>/infra/auth/ldap.yml, disable LDAP:

    ldap:
      enabled: false
    
  3. Enforce the linux.system update:

    salt '<target_node>*' state.sls linux.system
    
  4. Clean up nodes:

    salt '<target_node>*' cmd.run 'export DEBIAN_FRONTEND=noninteractive; apt purge -y libnss-ldapd libpam-ldapd; sed -i "s/ ldap//g" /etc/nsswitch.conf'
    

Enable Gerrit audit logging

This section instructs you on how to enable the audit logging in Gerrit by configuring the httpd requests logger. Fluentd collects the files with the error logs automatically.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To set up audit logging in Gerrit:

  1. Log in to the Salt Master node.

  2. Open the cluster level of your deployment model.

  3. In the cicd/control/leader.yml file, add following parameters:

    parameters:
      _param:
        ...
        gerrit_extra_opts: "-Dlog4j.configuration=file:///var/gerrit/review_site/etc/log4j.properties"
        gerrit_http_request_log: 'True'
        ...
      linux:
        system:
          file:
            "/srv/volumes/gerrit/etc/log4j.properties":
              contents:
                - log4j.logger.httpd_log=INFO,httpd_log
                - log4j.appender.httpd_log=org.apache.log4j.ConsoleAppender
                - log4j.appender.httpd_log.layout=com.google.gerrit.pgm.http.jetty.HttpLogLayout
    
  4. Refresh pillars:

    salt -C 'I@gerrit:client' saltutil.refresh_pillar
    
  5. Create the log4j.properties file:

    salt -C 'I@gerrit:client' state.apply linux.system.file
    
  6. Update the Gerrit service:

    salt -C 'I@gerrit:client' state.apply docker.client
    

Configure log rotation using logrotate

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to configure log rotation for selected services using the logrotate utility.

The services that support the log rotation configuration include:

  • OpenStack services

    Aodh, Barbican, Ceilometer, Cinder, Designate, Glance, Gnocchi, Heat, Keystone, Neutron, Nova, Octavia

  • Other services

    atop, Backupninja, Ceph, Elasticsearch, Galera (MySQL), GlusterFS, HAProxy, libvirt, MAAS, MongoDB, NGINX, Open vSwitch, PostgreSQL, RabbitMQ, Redis, Salt, Telegraf

MCP supports configuration of the rotation interval and number of rotations. Configuration of other logrotate options, postrotate and prerotate actions, and so on, are not supported.

To configure log rotation:

  1. Log in to the Salt Master node.

  2. Open the cluster level of your deployment model.

  3. Configure the interval and rotate parameters for the target service as required:

    • logrotate:interval

      Define the rotation time interval. Available values daily, weekly, monthly, and yearly.

    • logrotate:rotate

      Define the number of the rotated logs to be kept. The parameter expects the interger value.

    Use the Logrotate configuration table below to determine where to add the log rotation configuration.

    Logrotate configuration

    Service

    Pillar path (target)

    File path

    Aodh

    aodh:server

    openstack/telemetry.yml

    atop

    linux:system:atop

    The root file of the component 7

    Backupninja

    backupninja:client

    infra/backup/client_common.yml

    Barbican

    barbican:server

    openstack/barbican.yml

    Ceilometer server 0

    ceilometer:server

    openstack/telemetry.yml

    Ceilometer agent 0

    ceilometer:agent

    openstack/compute/init.yml

    Ceph

    ceph:common

    ceph/common.yml

    Cinder controller 1

    cinder:controller

    openstack/control.yml

    Cinder volume 1

    cinder:volume

    openstack/control.yml

    Designate

    designate:server

    openstack/control.yml

    Elasticsearch server

    elasticsearch:server

    stacklight/log.yml

    Elasticsearch client

    elasticsearch:client

    stacklight/log.yml

    Galera (MySQL) master

    galera:master

    openstack/database/master.yml

    Galera (MySQL) slave

    galera:slave

    openstack/database/slave.yml

    Glance

    glance:server

    openstack/control.yml

    GlusterFS server 2

    glusterfs:server

    The root file of the component 7

    GlusterFS client 2

    glusterfs:client

    The root file of the component 7

    Gnocchi server

    gnocchi:server

    openstack/telemetry.yml

    Gnocchi client

    gnocchi:client

    openstack/control/init.yml

    HAProxy

    haproxy:proxy

    openstack/proxy.yml

    Heat

    heat:server

    openstack/control.yml

    Keystone server

    keystone:server

    openstack/control.yml

    Keystone client

    keystone:client

    openstack/control/init.yml

    libvirt

    nova:compute:libvirt 3

    openstack/compute/init.yml

    MAAS

    maas:region

    infra/maas.yml

    MongoDB

    mongodb:server

    stacklight/server.yml

    Neutron server

    neutron:server

    openstack/control.yml

    Neutron client

    neutron:client

    openstack/control/init.yml

    Neutron gateway

    neutron:gateway

    openstack/gateway.yml

    Neutron compute

    neutron:compute

    openstack/compute/init.yml

    NGINX

    nginx:server

    openstack/proxy.yml, stacklight/proxy.yml

    Nova controller

    nova:controller

    openstack/control.yml

    Nova compute

    nova:compute

    openstack/compute/init.yml

    Octavia manager 4

    octavia:manager

    openstack/octavia_manager.yml

    Octavia client 4

    octavia:client

    openstack/control.yml

    Open vSwitch

    linux:network:openvswitch

    infra/init.yml

    PostgreSQL server

    postgresql:server (maas:region) 5

    infra/config/postgresql.yml (infra/maas.yml)

    PostgreSQL client

    postgresql:client (maas:region) 5

    infra/config/postgresql.yml (infra/maas.yml)

    RabbitMQ

    rabbitmq:server

    openstack/message_queue.yml

    Redis

    redis:server

    openstack/telemetry.yml

    Salt master 6

    salt:master

    infra/config/init.yml

    Salt minion 6

    salt:minion

    The root file of the component 7

    Telegraf

    telegraf:agent

    infra/init.yml, stacklight/server.yml

    0(1,2)

    If Ceilometer server and agent are specified on the same node, the server configuration is prioritized.

    1(1,2)

    If Cinder controller and volume are specified on the same node, the controller configuration is prioritized.

    2(1,2)

    If GlusterFS server and client are specified on the same node, the server configuration is prioritized.

    3

    Use nova:compute:libvirt as pillar path, but only nova:compute as target.

    4(1,2)

    If Octavia manager and client are specified on the same node, the manager configuration is prioritized.

    5(1,2)

    PostgreSQL is the dependenсу of MAAS. Configure PostgreSQL from the MAAS pillar only if the service has been installed as a dependency without the postgresql pillar defined. If the postgresql pillar is defined, configure it instead.

    6(1,2)

    If the Salt Master and minion are specified on the same node, the master configuration is prioritized.

    7(1,2,3,4)

    Depending on the nodes where you want to change the configuration, select their components’ root file. For example, infra/init.yml, openstack/control/init.yml, cicd/init.yml, and so on.

    For example, to set log rotation for Aodh to keep logs for the last 4 weeks with the daily rotation interval, add the following configuration to cluster/<cluster_name>/openstack/telemetry.yml:

    parameters:
      aodh:
        server:
          logrotate:
            interval: daily
            rotate: 28
    
  4. Apply the logrotate state on the node with the target service:

    salt -C 'I@<target>' saltutil.sync_all
    salt -C 'I@<target>' state.sls logrotate
    

    For example:

    salt -C 'I@aodh:server' state.sls logrotate
    

Configure remote logging for auditd

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to configure remote logging for auditd.

To configure remote logging for auditd:

  1. Log in to the Salt Master node.

  2. In the classes/cluster/<cluster_name>/ directory, open one of the following files:

    • To configure one remote host for auditd for all nodes, use infra/init.yml.

    • To configure a remote host for a set of nodes, use a specific configuration file. For example, openstack/compute/init.yml for all OpenStack compute nodes.

  3. Configure the remote host using the following exemplary pillar:

    parameters:
      audisp:
        enabled: true
        remote:
          remote_server: <ip_address or hostname>
          port: <port>
          local_port: any
          transport: tcp
          ...
          key1: value1
    
  4. Refresh pillars on the target nodes:

    salt <nodes> saltutil.refresh_pillar
    
  5. Apply the auditd.audisp state on the target nodes:

    salt <nodes> state.apply auditd.audisp
    

Configure memory limits for the Redis server

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to configure the memory rules and limits for the Redis server.

To configure memory limits for the Redis server:

  1. Log in to the Salt Master node.

  2. In classes/cluster/<cluster_name>/openstack/telemetry.yml, specify the following parameters as required:

    parameters:
      redis:
        server:
          maxmemory: 1073741824 # 1GB
          maxmemory-policy: <memory-policy>
          maxmemory-samples: 3
    

    Supported values for the memory-policy parameter include:

    • volatile-lru - the service removes the key with expiration time set using the Least Recently Used (LRU) algorithm

    • allkeys-lru - the service removes any key according to the LRU algorithm

    • volatile-random - the service removes a random key with an expiration time set

    • allkeys-random - the service removes any random key

    • volatile-ttl - the service removes the key with the nearest expiration time (minor TTL)

    • noeviction - the service does not remove any key but returns an error on write operations

  3. Apply the changes:

    salt -C 'I@redis:server' saltutil.refresh_pillar
    salt -C 'I@redis:server' state.apply redis.server
    

Configure multiple NTP servers

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

MCP enables you to configure multiple Network Time Protocol (NTP) servers on new or existing MCP clusters to provide a more flexible and wide NTP support for clustered applications such as Ceph, Galera, and others.

For new MCP clusters, configure multiple NTP servers during the deployment model creation using the ntp_servers parameter passed to Cookiecutter in the following format:

server1.ntp.org,server2.ntp.org,server3.ntp.org

For details, see Networking deployment parameters in MCP Deployment Guide: Create a deployment metadata model.

For existing MCP clusters, configure multiple NTP servers by updating the NTP configuration for MAAS and all nodes of an MCP cluster.

To configure multiple NTP servers for MAAS:

  1. Log in to the Salt Master node.

  2. Open the cluster level of your deployment model.

  3. In infra/maas.yml, update the MAAS pillars using the example below:

    parameters:
      maas:
        region:
          ...
          ntp:
            server1:
              enabled: True
              host: ntp.example1.org
            server2:
              enabled: True
              host: ntp.example2.o
    
  4. Update the MAAS configuration:

    salt-call saltutil.refresh_pillar
    salt-call state.apply maas.region
    

To configure multiple NTP servers for all nodes of an MCP cluster:

  1. Log in to the Salt Master node.

  2. Open the cluster level of your deployment model.

  3. In infra/init.yml, update the MAAS pillars using the example below:

    parameters:
      ntp:
        client:
        enabled: true
        stratum:
          primary:
            server: primary.ntp.org
          secondary: # if exist
            server: secondary.ntp.org
          srv_3: # if exist
            server: srv_3.ntp.org
    
  4. Update the NTP configuration:

    salt '*' saltutil.refresh_pillar
    salt '*' state.apply ntp.client
    

OpenStack operations

This section includes all OpenStack-related Day-2 operations such as reprovisioning of OpenStack controller and compute nodes, preparing the Ironic service to provision cloud workloads on bare metal nodes, and others.

Manage Virtualized Control Plane

This section describes operations with the MCP Virtualized Control Plane (VCP).

Add a controller node

If you need to expand the size of VCP to handle a bigger data plane, you can add more controller nodes to your cloud environment. This section instructs on how to add a KVM node and an OpenStack controller VM to an existing environment.

The same procedure can be applied for scaling the messaging, database, and any other services.

Additional parameters will have to be added before the deployment.

To add a controller node:

  1. Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

  2. Log in to the Salt Master node.

  3. In the /classes/cluster/<cluster_name>/infra/init.yml file, define the basic parameters for the new KVM node:

    parameters:
      _param:
        infra_kvm_node04_address: <IP ADDRESS ON CONTROL NETWORK>
        infra_kvm_node04_deploy_address: <IP ADDRESS ON DEPLOY NETWORK>
        infra_kvm_node04_storage_address: ${_param:infra_kvm_node04_address}
        infra_kvm_node04_public_address: ${_param:infra_kvm_node04_address}
        infra_kvm_node04_hostname: kvm<NUM>
        glusterfs_node04_address: ${_param:infra_kvm_node04_address}
    linux:
      network:
        host:
          kvm04:
            address: ${_param:infra_kvm_node04_address}
            names:
            - ${_param:infra_kvm_node04_hostname}
            - ${_param:infra_kvm_node04_hostname}.${_param:cluster_domain}
    
  4. In the /classes/cluster/<cluster_name>/openstack/init.yml file, define the basic parameters for the new OpenStack controller node.

    openstack_control_node<NUM>_address: <IP_ADDRESS_ON_CONTROL_NETWORK>
    openstack_control_node<NUM>_hostname: <HOSTNAME>
    openstack_database_node<NUM>_address: <DB_IP_ADDRESS>
    openstack_database_node<NUM>_hostname: <DB_HOSTNAME>
    openstack_message_queue_node<NUM>_address: <IP_ADDRESS_OF_MESSAGE_QUEUE>
    openstack_message_queue_node<NUM>_hostname: <HOSTNAME_OF_MESSAGE_QUEUE>
    

    Example of configuration:

    kvm04_control_ip: 10.167.4.244
    kvm04_deploy_ip: 10.167.5.244
    kvm04_name: kvm04
    openstack_control_node04_address: 10.167.4.14
    openstack_control_node04_hostname: ctl04
    
  5. In the /classes/cluster/<cluster_name>/infra/config.yml file, define the configuration parameters for the KVM and OpenStack controller nodes. For example:

    reclass:
      storage:
        node:
          infra_kvm_node04:
            name: ${_param:infra_kvm_node04_hostname}
            domain: ${_param:cluster_domain}
            classes:
            - cluster.${_param:cluster_name}.infra.kvm
            params:
              keepalived_vip_priority: 103
              salt_master_host: ${_param:reclass_config_master}
              linux_system_codename: xenial
              single_address: ${_param:infra_kvm_node04_address}
              deploy_address: ${_param:infra_kvm_node04_deploy_address}
              public_address: ${_param:infra_kvm_node04_public_address}
              storage_address: ${_param:infra_kvm_node04_storage_address}
          openstack_control_node04:
            name: ${_param:openstack_control_node04_hostname}
            domain: ${_param:cluster_domain}
            classes:
            - cluster.${_param:cluster_name}.openstack.control
            params:
              salt_master_host: ${_param:reclass_config_master}
              linux_system_codename: xenial
              single_address: ${_param:openstack_control_node04_address}
              keepalived_vip_priority: 104
              opencontrail_database_id: 4
              rabbitmq_cluster_role: slave
    
  6. In the /classes/cluster/<cluster_name>/infra/kvm.yml file, define new brick for GlusterFS on all KVM nodes and salt:control which later spawns the OpenStack controller node. For example:

    _param:
      cluster_node04_address: ${_param:infra_kvm_node04_address}
    glusterfs:
      server:
        volumes:
          glance:
            replica: 4
            bricks:
              - ${_param:cluster_node04_address}:/srv/glusterfs/glance
          keystone-keys:
            replica: 4
            bricks:
              - ${_param:cluster_node04_address}:/srv/glusterfs/keystone-keys
          keystone-credential-keys:
            replica: 4
            bricks:
              - ${_param:cluster_node04_address}:/srv/glusterfs/keystone-credential-keys
    salt:
      control:
        cluster:
          internal:
            domain: ${_param:cluster_domain}
            engine: virt
            node:
              ctl04:
                name: ${_param:openstack_control_node04_hostname}
                provider: ${_param:infra_kvm_node04_hostname}.${_param:cluster_domain}
                image: ${_param:salt_control_xenial_image}
                size: openstack.control
    
  7. In the /classes/cluster/<cluster_name>/openstack/control.yml file, add the OpenStack controller node into existing services such as HAProxy, and others, depending on your environment configuration.

    Example of adding an HAProxy host for Glance:

    _param:
      cluster_node04_hostname: ${_param:openstack_control_node04_hostname}
      cluster_node04_address: ${_param:openstack_control_node04_address}
    haproxy:
      proxy:
        listen:
          glance_api:
            servers:
            - name: ${_param:cluster_node04_hostname}
              host: ${_param:cluster_node04_address}
              port: 9292
              params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3
          glance_registry_api:
            servers:
            - name: ${_param:cluster_node04_hostname}
              host: ${_param:cluster_node04_address}
              port: 9191
              params: check
    
  8. Refresh the deployed pillar data by applying the reclass.storage state:

    salt '*cfg*' state.sls reclass.storage
    
  9. Verify that the target node has connectivity with the Salt Master node:

    salt '*kvm<NUM>*' test.ping
    
  10. Verify that the Salt Minion nodes are synchronized:

    salt '*' saltutil.sync_all
    
  11. On the Salt Master node, apply the Salt linux state for the added node:

    salt -C 'I@salt:control' state.sls linux
    
  12. On the added node, verify that salt-common and salt-minion have the 2017.7 version.

    apt-cache policy salt-common
    apt-cache policy salt-minion
    

    Note

    If the commands above show a different version, follow the MCP Deployment guide: Install the correct versions of salt-common and salt-minion.

  13. Perform the initial Salt configuration:

    salt -C 'I@salt:control' state.sls salt.minion
    
  14. Set up the network interfaces and the SSH access:

    salt -C 'I@salt:control' state.sls linux.system.user,openssh,linux.network,ntp
    
  15. Reboot the KVM node:

    salt '*kvm<NUM>*' cmd.run 'reboot'
    
  16. On the Salt Master node, apply the libvirt state:

    salt -C 'I@salt:control' state.sls libvirt
    
  17. On the Salt Master node, create a controller VM for the added physical node:

    salt -C 'I@salt:control' state.sls salt.control
    

    Note

    Salt virt takes the name of a virtual machine and registers the virtual machine on the Salt Master node.

    Once created, the instance picks up an IP address from the MAAS DHCP service and the key will be seen as accepted on the Salt Master node.

  18. Verify that the controller VM has connectivity with the Salt Master node:

    salt 'ctl<NUM>*' test.ping
    
  19. Verify that the Salt Minion nodes are synchronized:

    salt '*' saltutil.sync_all
    
  20. Apply the Salt highstate for the controller VM:

    salt -C 'I@salt:control' state.highstate
    
  21. Verify that the added controller node is registered on the Salt Master node:

    salt-key
    
  22. To reconfigure VCP VMs, run the openstack-deploy Jenkins pipeline with all necessary install parmeters as described in MCP Deployment guide: Deploy an OpenStack environment.

Replace a KVM node

If a KVM node hosting the Virtualized Control Plane has failed and recovery is not possible, you can recreate the KVM node from scratch with all VCP VMs that were hosted on the old KVM node. The replaced KVM node will be assigned the same IP addresses as the failed KVM node.

Replace a failed KVM node

This section describes how to recreate a failed KVM node with all VCP VMs that were hosted on the old KVM node. The replaced KVM node will be assigned the same IP addresses as the failed KVM node.

To replace a failed KVM node:

  1. Log in to the Salt Master node.

  2. Copy and keep the hostname and GlusterFS UUID of the old KVM node.

    To obtain the UUIDs of all peers in the cluster:

    salt '*kvm<NUM>*' cmd.run "gluster peer status"
    

    Note

    Run the command above from a different KVM node of the same cluster since the command outputs other peers only.

  3. Verify that the KVM node is not registered in salt-key. If the node is present, remove it:

    salt-key | grep kvm<NUM>
    salt-key -d kvm<NUM>.domain_name
    
  4. Remove the salt-key records for all VMs originally running on the failed KVM node:

    salt-key -d <kvm_node_name><NUM>.domain_name
    

    Note

    You can list all VMs running on the KVM node using the salt '*kvm<NUM>*' cmd.run 'virsh list --all' command. Alternatively, obtain the list of VMs from cluster/infra/kvm.yml.

  5. Add or reprovision a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

  6. Verify that the new node has been registered on the Salt Master node successfully:

    salt-key | grep kvm
    

    Note

    If the new node is not available in the list, wait some time until the node becomes available or use the IPMI console to troubleshoot the node.

  7. Verify that the target node has connectivity with the Salt Master node:

    salt '*kvm<NUM>*' test.ping
    
  8. Verify that salt-common and salt-minion have the same version for the new node as the rest of the cluster.

    salt -t 10 'kvm*' cmd.run 'dpkg -l |grep "salt-minion\|salt-common"'
    

    Note

    If the command above shows a different version for the new node, follow the steps described in Install the correct versions of salt-common and salt-minion.

  9. Verify that the Salt Minion nodes are synchronized:

    salt '*' saltutil.refresh_pillar
    
  10. Apply the linux state for the added node:

    salt '*kvm<NUM>*' state.sls linux
    
  11. Perform the initial Salt configuration:

    1. Run the following commands:

      salt '*kvm<NUM>*' cmd.run "touch /run/is_rebooted"
      salt '*kvm<NUM>*' cmd.run 'reboot'
      

      Wait some time before the node is rebooted.

    2. Verify that the node is rebooted:

      salt '*kvm<NUM>*' cmd.run 'if [ -f "/run/is_rebooted" ];then echo \
      "Has not been rebooted!";else echo "Rebooted";fi'
      

      Note

      The node must be in the Rebooted state.

  12. Set up the network interfaces and the SSH access:

    salt -C 'I@salt:control' state.sls linux.system.user,openssh,linux.network,ntp
    
  13. Apply the libvirt state for the added node:

    salt '*kvm<NUM>*' state.sls libvirt
    
  14. Recreate the original VCP VMs on the new node:

    salt '*kvm<NUM>*' state.sls salt.control
    

    Note

    Salt virt takes the name of a VM and registers it on the Salt Master node.

    Once created, the instance picks up an IP address from the MAAS DHCP service and the key will be seen as accepted on the Salt Master node.

  15. Verify that the added VCP VMs are registered on the Salt Master node:

    salt-key
    
  16. Verify that the Salt Minion nodes are synchronized:

    salt '*' saltutil.sync_all
    
  17. Apply the highstate for the VCP VMs:

    salt '*kvm<NUM>*' state.highstate
    
  18. Verify whether the new node has correct IP address and proceed to restore GlusterFS configuration as described in Recover GlusterFS on a replaced KVM node.

Recover GlusterFS on a replaced KVM node

After you replace a KVM node as described in Replace a failed KVM node, if your new KVM node has the same IP address, proceed with recovering GlusterFS as described below.

To recover GlusterFS on a replaced KVM node:

  1. Log in to the Salt Master node.

  2. Define the IP address of the failed and any working KVM node that is running the GlusterFS cluster services. For example:

    FAILED_NODE_IP=<IP_of_failed_kvm_node>
    WORKING_NODE_IP=<IP_of_working_kvm_node>
    
  3. If the failed node has been recovered with the old disk and GlusterFS installed:

    1. Remove the /var/lib/glusterd directory:

      salt -S $FAILED_NODE_IP file.remove '/var/lib/glusterd'
      
    2. Restart glusterfs-server:

      salt -S $FAILED_NODE_IP service.restart glusterfs-server
      
  4. Configure glusterfs-server on the failed node:

    salt -S $FAILED_NODE_IP state.apply glusterfs.server.service
    
  5. Remove the failed node from the GlusterFS cluster:

    salt -S $WORKING_NODE_IP cmd.run "gluster peer detach $FAILED_NODE_IP"
    
  6. Re-add the failed node to the GlusterFS cluster with a new ID:

    salt -S $WORKING_NODE_IP cmd.run "gluster peer probe $FAILED_NODE_IP"
    
  7. Finalize the configuration of the failed node:

    salt -S $FAILED_NODE_IP state.apply
    
  8. Set the correct trusted.glusterfs.volume-id attribute in the GlusterFS directories on the failed node:

    for vol in $(salt --out=txt -S $WORKING_NODE_IP cmd.run 'for dir in /srv/glusterfs/*; \
    do echo -n "${dir}@0x"; getfattr  -n trusted.glusterfs.volume-id \
    --only-values --absolute-names $dir | xxd -g0 -p;done' | awk -F: '{print $2}'); \
    do VOL_PATH=$(echo $vol| cut -d@ -f1); TRUST_ID=$(echo $vol | cut -d@ -f2); \
    salt -S $FAILED_NODE_IP cmd.run "setfattr -n trusted.glusterfs.volume-id -v $TRUST_ID $VOL_PATH"; \
    done
    
  9. Restart glusterfs-server:

    salt -S $FAILED_NODE_IP service.restart glusterfs-server
    

Move a VCP node to another host

To ensure success during moving the VCP VMs running in the cloud environment for specific services, take a single VM at a time, stop it, move the disk to another host, and start the VM again on the new host machine. The services running on the VM should remain running during the whole process due to high availability ensured by Keepalived and HAProxy.

To move a VCP node to another host:

  1. To synchronize your deployment model with the new setup, update the /classes/cluster/<cluster_name>/infra/kvm.yml file:

    salt:
      control:
        cluster:
          internal:
            node:
              <nodename>:
                name: <nodename>
                provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
                # replace 'infra_kvm_node03_hostname' param with the new kvm nodename provider
    
  2. Apply the salt.control state on the new KVM node:

    salt-call state.sls salt.control
    
  3. Destroy the newly spawned VM on the new KVM node:

    virsh list
    virsh destroy <nodename><nodenum>.<domainname>
    
  4. Log in to the KVM node originally hosting the VM.

  5. Stop the VM:

    virsh list
    virsh destroy <nodename><nodenum>.<domainname>
    
  6. Move the disk to the new KVM node using, for exmaple, the scp utility, replacing the empty disk spawned by the salt.control state with the correct one:

    scp /var/lib/libvirt/images/<nodename><nodenum>.<domainname>/system.qcow2 \
    <diff_kvm_nodename>:/var/lib/libvirt/images/<nodename><nodenum>.<domainname>/system.qcow2
    
  7. Start the VM on the new KVM host:

    virsh start <nodename><nodenum>.<domainname>
    
  8. Verify that the services on the moved VM work correctly.

  9. Log in to the KVM node that was hosting the VM originally and undefine it:

    virsh list --all
    virsh undefine <nodename><nodenum>.<domainname>
    

Manage compute nodes

This section provides instructions on how to manage the compute nodes in your cloud environment.

Add a compute node

This section describes how to add a new compute node to an existing OpenStack environment.

To add a compute node:

  1. Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

  2. Verify that the compute node is defined in /classes/cluster/<cluster_name>/infra/config.yml.

    Note

    Create as many hosts as you have compute nodes in your environment within this file.

    Note

    Verify that the count parameter is increased by the number of compute nodes being added.

    Configuration example if the dynamic compute host generation is used:

    reclass:
      storage:
        node:
          openstack_compute_rack01:
            name: ${_param:openstack_compute_rack01_hostname}<<count>>
            domain: ${_param:cluster_domain}
            classes:
            - cluster.${_param:cluster_name}.openstack.compute
            repeat:
              count: 20
              start: 1
              digits: 3
              params:
                single_address:
                  value: 172.16.47.<<count>>
                  start: 101
                tenant_address:
                  value: 172.16.47.<<count>>
                  start: 101
            params:
              salt_master_host: ${_param:reclass_config_master}
              linux_system_codename: xenial
    

    Configuration example if the static compute host generation is used:

    reclass:
      storage:
        node:
          openstack_compute_node01:
            name: cmp01
            domain: ${_param:cluster_domain}
            classes:
            - cluster.${_param:cluster_name}.openstack.compute
            params:
              salt_master_host: ${_param:reclass_config_master}:
              linux_system_codename: xenial
              single_address: 10.0.0.101
              deploy_address: 10.0.1.101
              tenant_address: 10.0.2.101
    
  3. Define the cmp<NUM> control address and hostname in the <cluster>/openstack/init.yml file:

    _param:
      openstack_compute_node<NUM>_address: <control_network_IP>
      openstack_compute_node<NUM>_hostname: cmp<NUM>
    
    linux:
      network:
        host:
          cmp<NUM>:
            address: ${_param:openstack_compute_node<NUM>_address}
            names:
            - ${_param:openstack_compute_node<NUM>_hostname}
            - ${_param:openstack_compute_node<NUM>_hostname}.${_param:cluster_domain}
    
  4. Apply the reclass.storage state on the Salt Master node to generate node definitions:

    salt '*cfg*' state.sls reclass.storage
    
  5. Verify that the target nodes have connectivity with the Salt Master node:

    salt '*cmp<NUM>*' test.ping
    
  6. Apply the following states:

    salt 'cfg*' state.sls salt.minion.ca
    salt '*cmp<NUM>*' state.sls salt.minion.cert
    
  7. Deploy a new compute node as described in MCP Deployment Guide: Deploy physical servers.

    Caution

    Do not use compounds for this step, since it will affect already running physical servers and reboot them. Use the Salt minion IDs instead of compounds before running the pipelines or deploying physical servers manually.

    Incorrect:

    salt -C 'I@salt:control or I@nova:compute or I@neutron:gateway' \
     cmd.run "touch /run/is_rebooted"
    salt --async -C 'I@nova:compute' cmd.run 'salt-call state.sls \
     linux.system.user,openssh,linux.network;reboot'
    

    Correct:

    salt cmp<NUM> cmd.run "touch /run/is_rebooted"
    salt --async cmp<NUM> cmd.run 'salt-call state.sls \
     linux.system.user,openssh,linux.network;reboot'
    

    Note

    We recommend that you rerun the Jenkins Deploy - OpenStack pipeline that runs on the Salt Master node with the same parameters as you have set initially during your environment deployment. This guarantees that your compute node will be properly set up and added.

Reprovision a compute node

Provisioning of compute nodes is relatively straightforward as you can run all states at once. Though, you need to run and reboot it multiple times for network configuration changes to take effect.

Note

Multiple reboots are needed because the ordering of dependencies is not yet orchestrated.

To reprovision a compute node:

  1. Verify that the name of the cmp node is not registered in salt-key on the Salt Master node:

    salt-key | grep 'cmp*'
    

    If the node is shown in the above command output, remove it:

    salt-key -d cmp<NUM>.domain_name
    
  2. Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

  3. Verify that the required nodes are defined in /classes/cluster/<cluster_name>/infra/config.yml.

    Note

    Create as many hosts as you have compute nodes in your environment within this file.

    Configuration example if the dynamic compute host generation is used:

    reclass:
      storage:
        node:
          openstack_compute_rack01:
            name: ${_param:openstack_compute_rack01_hostname}<<count>>
            domain: ${_param:cluster_domain}
            classes:
            - cluster.${_param:cluster_name}.openstack.compute
            repeat:
              count: 20
              start: 1
              digits: 3
              params:
                single_address:
                  value: 172.16.47.<<count>>
                  start: 101
                tenant_address:
                  value: 172.16.47.<<count>>
                  start: 101
            params:
              salt_master_host: ${_param:reclass_config_master}
              linux_system_codename: xenial
    

    Configuration example if the static compute host generation is used:

    reclass:
      storage:
        node:
          openstack_compute_node01:
            name: cmp01
            domain: ${_param:cluster_domain}
            classes:
            - cluster.${_param:cluster_name}.openstack.compute
            params:
              salt_master_host: ${_param:reclass_config_master}
              linux_system_codename: xenial
              single_address: 10.0.0.101
              deploy_address: 10.0.1.101
              tenant_address: 10.0.2.101
    
  4. Apply the reclass.storage state on the Salt Master node to generate node definitions:

    salt '*cfg*' state.sls reclass.storage
    
  5. Verify that the target nodes have connectivity with the Salt Master node:

    salt '*cmp<NUM>*' test.ping
    
  6. Verify that the Salt Minion nodes are synchronized:

    salt '*cmp<NUM>*' saltutil.sync_all
    
  7. Apply the Salt highstate on the compute node(s):

    salt '*cmp<NUM>*' state.highstate
    

    Note

    Failures may occur during the first run of highstate. Rerun the state until it is successfully applied.

  8. Reboot the compute node(s) to apply network configuration changes.

  9. Reapply the Salt highstate on the node(s):

    salt '*cmp<NUM>*' state.highstate
    
  10. Provision the vRouter on the compute node using CLI or the Contrail web UI. Example of the CLI command:

    salt '*cmp<NUM>*' cmd.run '/usr/share/contrail-utils/provision_vrouter.py \
        --host_name <CMP_HOSTNAME> --host_ip <CMP_IP_ADDRESS> --api_server_ip <CONTRAIL_VIP> \
        --oper add --admin_user admin --admin_password <PASSWORD> \
        --admin_tenant_name admin --openstack_ip <OPENSTACK_VIP>'
    

    Note

    • To obtain <CONTRAIL_VIP>, run salt-call pillar.get _param:keepalived_vip_address on any ntw node.

    • To obtain <OPENSTACK_VIP>, run salt-call pillar.get _param:keepalived_vip_address on any ctl node.

Remove a compute node

This section instructs you on how to safely remove a compute node from your OpenStack environment.

To remove a compute node:

  1. Stop and disable the salt-minion service on the compute node you want to remove:

    systemctl stop salt-minion
    systemctl disable salt-minion
    
  2. Verify that the name of the node is not registered in salt-key on the Salt Master node. If the node is present, remove it:

    salt-key | grep cmp<NUM>
    salt-key -d cmp<NUM>.domain_name
    
  3. Log in to an OpenStack controller node.

  4. Source the OpenStack RC file to set the required environment variables for the OpenStack command-line clients:

    source keystonerv3
    
  5. Disable the nova-compute service on the target compute node:

    openstack compute service set --disable <cmp_host_name> nova-compute
    
  6. Verify that Nova does not schedule new instances on the target compute node by viewing the output of the following command:

    openstack compute service list
    

    The command output should display the disabled status for the nova-compute service running on the target compute node.

  7. Migrate your instances using the openstack server migrate command. You can perform live or cold migration.

  8. Log in to the target compute node.

  9. Stop the nova-compute service:

    systemctl disable nova-compute
    systemctl stop nova-compute
    
  10. Log in to the OpenStack controller node.

  11. Obtain the ID of the compute service to delete:

    openstack compute service list
    
  12. Delete the compute service substituting service_id with the value obtained in the previous step:

    openstack compute service delete <service_id>
    
  13. Select from the following options:

    • For the deployments with OpenContrail:

      1. Log in to the target compute node.

      2. Stop the supervisor-vrouter service:

        service supervisor-vrouter disable
        service supervisor-vrouter stop
        
      3. Log in to the OpenContrail UI.

      4. Navigate to Configure > infrastracture > Virtual Routers.

      5. Select the target compute node.

      6. Click Delete.

    • For the deployments with OVS:

      1. Stop the neutron-openvswitch-agent service:

        systemctl disable neutron-openvswitch-agent.service
        systemctl stop neutron-openvswitch-agent.service
        
      2. Obtain the ID of the target compute node agent:

        openstack network agent list
        
      3. Delete the network agent substituting cmp_agent_id with the value obtained in the previous step:

        openstack network agent delete <cmp_agent_id>
        
  14. If you plan to replace the removed compute node with a new compute node with the same hostname, you need to manually clean up the resource provider record from the placement service using the curl tool:

    1. Log in to an OpenStack controller node.

    2. Obtain the token ID from the openstack token issue command output. For example:

      openstack token issue
      +------------+-------------------------------------+
      | Field      | Value                               |
      +------------+-------------------------------------+
      | expires    | 2018-06-22T10:30:17+0000            |
      | id         | gAAAAABbLMGpVq2Gjwtc5Qqmp...        |
      | project_id | 6395787cdff649cdbb67da7e692cc592    |
      | user_id    | 2288ac845d5a4e478ffdc7153e389310    |
      +------------+-------------------------------------+
      
    3. Obtain the resource provider UUID of the target compute node:

      curl -i -X GET <placement-endpoint-address>/resource_providers?name=<target-compute-host-name> -H \
      'content-type: application/json' -H 'X-Auth-Token: <token>'
      

      Susbtitute the following parameters as required:

      • placement-endpoint-address

        The placement endpoint can be obtained from the openstack catalog list command output. A placement endpoint includes the scheme, endpoint address, and port, for example, http://10.11.0.10:8778. Depending on the deployment, you may need to specify the https scheme rather than http.

      • target-compute-host-name

        The hostname of the compute node you are removing. For the correct hostname format to pass, see the Hypervisor Hostname column in the openstack hypervisor list command output.

      • token

        The token id value obtained in the previous step.

      Example of system response:

      {
        "resource_providers": [
          {
            "generation": 1,
            "uuid": "08090377-965f-4ad8-9a1b-87f8e8153896",
            "links": [
              {
                "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896",
                "rel": "self"
              },
              {
                "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896/aggregates",
                "rel": "aggregates"
              },
              {
                "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896/inventories",
                "rel": "inventories"
              },
              {
                "href": "/resource_providers/08090377-965f-4ad8-9a1b-87f8e8153896/usages",
                "rel": "usages"
              }
            ],
            "name": "<compute-host-name>"
          }
        ]
      }
      
    4. Delete the resource provider record from the placement service substituting placement-endpoint-address, target-compute-node-uuid, and token with the values obtained in the previous steps:

      curl -i -X DELETE <placement-endpoint-address>/resource_providers/<target-compute-node-uuid> -H \
      'content-type: application/json' -H 'X-Auth-Token: <token>'
      
  15. Log in to the Salt Master node.

  16. Remove the compute node definition from the model in infra/config.yml under the reclass:storage:node pillar.

  17. Remove the generated file for the removed compute node under /srv/salt/reclass/nodes/_generated.

  18. Remove the compute node from StackLight LMA:

    1. Update and clear the Salt mine:

      salt -C 'I@salt:minion' state.sls salt.minion.grains
      salt -C 'I@salt:minion' saltutil.refresh_modules
      salt -C 'I@salt:minion' mine.update clear=true
      
    2. Refresh the targets and alerts:

      salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus -b 1
      

Reboot a compute node

This section instructs you on how to reboot an OpenStack compute node for a planned maintenance.

To reboot an OpenStack compute node:

  1. Log in to an OpenStack controller node.

  2. Disable scheduling of new VMs to the node. Optionally provide a reason comment:

    openstack compute service set --disable --disable-reason \
    maintenance <compute_node_hostname> nova-compute
    
  3. Migrate workloads from the OpenStack compute node:

    nova host-evacuate-live <compute_node_hostname>
    
  4. Log in to an OpenStack compute node.

  5. Stop the nova-compute service:

    service nova-compute stop
    
  6. Shut down the OpenStack compute node, perform the maintenance, and turn the node back on.

  7. Verify that the nova-compute service is up and running:

    service nova-compute status
    
  8. Perform the following steps from the OpenStack controller node:

    1. Enable scheduling of VMs to the node:

      openstack compute service set --enable <compute_node_hostname> nova-compute
      
    2. Verify that the nova-compute service and neutron agents are running on the node:

      openstack network agent list --host <compute_node_hostname>
      openstack compute service list --host <compute_node_hostname>
      

      The OpeStack compute service state must be up. The Neutron agent service state must be UP and the alive column must include :-).

      Examples of a positive system response:

      +----+--------------+------+------+---------+-------+----------------------------+
      | ID | Binary       | Host | Zone | Status  | State | Updated At                 |
      +----+--------------+------+------+---------+-------+----------------------------+
      | 70 | nova-compute | cmp1 | nova | enabled | up    | 2020-09-17T08:51:07.000000 |
      +----+--------------+------+------+---------+-------+----------------------------+
      
      +----------+--------------------+------+-------------------+-------+-------+---------------------------+
      | ID       | Agent Type         | Host | Availability Zone | Alive | State | Binary                    |
      +----------+--------------------+------+-------------------+-------+-------+---------------------------+
      | e4256d73 | Open vSwitch agent | cmp1 | None              | :-)   | UP    | neutron-openvswitch-agent |
      +----------+--------------------+------+-------------------+-------+-------+---------------------------+
      
    3. Optional. Migrate the instances back to their original OpenStack compute node.

Manage gateway nodes

This section describes how to manage tenant network gateway nodes that provide access to an external network for the environments configured with Neutron OVS as a networking solution.

Add a gateway node

The gateway nodes are hardware nodes that provide gateways and routers to the OVS-based tenant networks using network virtualization functions. Standard cloud configuration includes three gateway nodes. Though, you can scale the networking thoughput by adding more gateway servers.

This section explains how to increase the number of the gateway nodes in your cloud environment.

To add a gateway node:

  1. Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

  2. Define the gateway node in /classes/cluster/<cluster_name>/infra/config.yml. For example:

    parameters:
      _param:
        openstack_gateway_node03_hostname: gtw03
        openstack_gateway_node03_tenant_address: <IP_of_gtw_node_tenant_address>
      reclass:
        storage:
          node:
            openstack_gateway_node03:
              name: ${_param:openstack_gateway_node03_hostname}
              domain: ${_param:cluster_domain}
              classes:
              - cluster.${_param:cluster_name}.openstack.gateway
              params:
                salt_master_host: ${_param:reclass_config_master}
                linux_system_codename: ${_param:linux_system_codename}
                single_address: ${_param:openstack_gateway_node03_address}
                tenant_address: ${_param:openstack_gateway_node03_tenant_address}
    
  3. On the Salt Master node, generate node definitions by applying the reclass.storage state:

    salt '*cfg*' state.sls reclass.storage
    
  4. Verify that the target nodes have connectivity with the Salt Master node:

    salt '*gtw<NUM>*' test.ping
    
  5. Verify that the Salt Minion nodes are synchronized:

    salt '*gtw<NUM>*' saltutil.sync_all
    
  6. On the added node, verify that salt-common and salt-minion have the 2017.7 version.

    apt-cache policy salt-common
    apt-cache policy salt-minion
    

    Note

    If the commands above show a different version, follow the MCP Deployment guide: Install the correct versions of salt-common and salt-minion.

  7. Perform the initial Salt configuration:

    salt '*gtw<NUM>*' state.sls salt.minion
    
  8. Set up the network interfaces and the SSH access:

    salt '*gtw<NUM>*' state.sls linux.system.user,openssh,linux.network,ntp,neutron
    
  9. Apply the highstate on the gateway node:

    salt '*gtw<NUM>*' state.highstate
    

Reprovision a gateway node

If an tenant network gateway node is down, you may need to reprovision it.

To reprovision a gateway node:

  1. Verify that the name of the gateway node is not registered in salt-key on the Salt Master node. If the node is present, remove it:

    salt-key | grep gtw<NUM>
    salt-key -d gtw<NUM>.domain_name
    
  2. Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

  3. Verify that the required gateway node is defined in /classes/cluster/<cluster_name>/infra/config.yml.

  4. Generate the node definition, by applying the reclass.storage state on the Salt Master node:

    salt '*cfg*' state.sls reclass.storage
    
  5. Verify that the target node has connectivity with the Salt Master node:

    salt '*gtw<NUM>*' test.ping
    
  6. Verify that the Salt Minion nodes are synchronized:

    salt '*gtw<NUM>*' saltutil.sync_all
    
  7. On the added node, verify that salt-common and salt-minion have the 2017.7 version.

    apt-cache policy salt-common
    apt-cache policy salt-minion
    

    Note

    If the commands above show a different version, follow the MCP Deployment guide: Install the correct versions of salt-common and salt-minion.

  8. Perform the initial Salt configuration:

    salt '*gtw<NUM>*' state.sls salt.minion
    
  9. Set up the network interfaces and the SSH access:

    salt '*gtw<NUM>*' state.sls linux.system.user,openssh,linux.network,ntp,neutron
    
  10. Apply the Salt highstate on the gateway node:

    salt '*gtw<NUM>*' state.highstate
    

Manage RabbitMQ nodes

A RabbitMQ cluster is sensitive to external factors like network throughput and traffic spikes. When running under high load, it requires special start, stop, and restart procedures.

Restart a RabbitMQ node

Caution

We recommend that you do not restart a RabbitMQ node on a production environment by executing systemctl restart rabbitmq-server since a cluster can become inoperative.

To restart a single RabbitMQ node:

  1. Gracefully stop rabbitmq-server on the target node:

    systemctl stop rabbitmq-server
    
  2. Verify that the node is removed from the cluster and RabbitMQ is stopped on this node:

    rabbitmqctl cluster_status
    

    Example of system response:

    Cluster status of node rabbit@msg01
    [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
    {running_nodes,[rabbit@msg03,rabbit@msg01]},  # <<< rabbit stopped on msg02
    {cluster_name,<<"openstack">>},
    {partitions,[]},
    {alarms,[{rabbit@msg03,[]},{rabbit@msg01,[]}]}]
    
  3. Start rabbitmq-server:

    systemctl start rabbitmq-server
    

Restart a RabbitMQ cluster

To restart the whole RabbitMQ cluster:

  1. Stop RabbitMQ on nodes one by one:

    salt msg01* cmd.run 'systemctl stop rabbitmq-server'
    salt msg02* cmd.run 'systemctl stop rabbitmq-server'
    salt msg03* cmd.run 'systemctl stop rabbitmq-server'
    
  2. Restart RabbitMQ in the reverse order:

    salt msg03* cmd.run 'systemctl start rabbitmq-server'
    salt msg02* cmd.run 'systemctl start rabbitmq-server'
    salt msg01* cmd.run 'systemctl start rabbitmq-server'
    

Restart RabbitMQ with clearing the Mnesia database

To restart RabbitMQ with clearing the Mnesia database:

  1. Stop RabbitMQ on nodes one by one:

    salt msg01* cmd.run 'systemctl stop rabbitmq-server'
    salt msg02* cmd.run 'systemctl stop rabbitmq-server'
    salt msg03* cmd.run 'systemctl stop rabbitmq-server'
    
  2. Remove the Mnesia database on all nodes:

    salt msg0* cmd.run 'rm -rf /var/lib/rabbitmq/mnesia/'
    
  3. Apply the rabbitmq state on the first RabbitMQ node:

    salt msg01* state.apply rabbitmq
    
  4. Apply the rabbitmq state on the remaining RabbitMQ nodes:

    salt -C "msg02* or msg03*" state.apply rabbitmq
    

Remove a node

Removal of a node from a Salt-managed environment is a matter of disabling the salt-minion service running on the node, removing its key from the Salt Master node, and updating the services so that they know that the node is not available anymore.

To remove a node:

  1. Stop and disable the salt-minion service on the node you want to remove:

    systemctl stop salt-minion
    systemctl disable salt-minion
    
  2. Verify that the name of the node is not registered in salt-key on the Salt Master node. If the node is present, remove it:

    salt-key | grep <nodename><NUM>
    salt-key -d <nodename><NUM>.domain_name
    
  3. Update your Reclass metadata model to remove the node from services. Apply the necessary Salt states. This step is generic as different services can be involved depending on the node being removed.

Manage certificates

After you deploy an MCP cluster, you can renew your expired certificates or replace them by the endpoint certificates provided by a customer as required. When you renew a certificate, its key remains the same. When you replace a certificate, a new certificate key is added accordingly.

You can either push certificates from pillars or regenerate them as follows:

  • Generate and update by salt-minion (signed by salt-master)

  • Generate and update by external certificate authorities, for example, by Let’s Encrypt

Certificates generated by salt-minion can be renewed by the salt-minion state. The renewal operation becomes available within 30 days before the expiration date. This is controlled by the days_remaining parameter of the x509.certificate_managed Salt state. Refer to Salt.states.x509 for details.

You can force renewal of certificates by removing old certificates and running salt.minion.cert state on each target node.

Publish CA certificates

If you use certificates issued by Certificate Authorities that are not recognized by an operating system, you must publish them.

To publish CA certificates:

  1. Open your project Git repository with the Reclass model on the cluster level.

  2. Create the /infra/ssl/init.yml file with the following configuration as an example:

    parameters:
      linux:
        system:
          ca_certificates:
            ca-salt_master_ca: |
              -----BEGIN CERTIFICATE-----
              MIIGXzCCBEegAwIBAgIDEUB0MA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
              ...
              YqQO
              -----END CERTIFICATE-----
            ca-salt_master_ca_old: |
              -----BEGIN CERTIFICATE-----
              MIIFgDCCA2igAwIBAgIDET0sMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
              ...
              WzUuf8H9dBW2DPtk5Jq/+QWtYMs=
              -----END CERTIFICATE-----
    
  3. To publish the certificates on all nodes managed by Salt, update /infra/init.yml by adding the newly created class:

    classes:
    - cluster.<cluster_name>.infra.ssl
    
  4. To publish the certificates on a specific node, update /infra/config.yml. For example:

    parameters:
      reclass:
        storage:
          node:
            openstack_control_node01:
              classes:
              - cluster.${_param:cluster_name}.openstack.ssl
    
  5. Log in to the Salt Master node.

  6. Update the Reclass storage:

    salt-call state.sls reclass.storage -l debug
    
  7. Apply the linux.system state on all nodes:

    salt \* state.sls linux.system.certificate -l debug
    

NGINX certificates

This section describes how to renew or replace the NGINX certificates managed by either salt-minion or self-managed certificates using pillars. For both cases, you must verify the GlusterFS share salt_pki before renewal.

Verify the GlusterFS share salt_pki

Before you proceed with the NGINX certificates renewal or replacement, verify the GlusterFS share salt_pki.

To verify the GlusterFS share salt_pki:

  1. Log in to any infrastructure node that hosts the salt_pki GlusterFS volume.

  2. Obtain the list of the GlusterFS minions IDs:

    salt -C 'I@glusterfs:server' test.ping --output yaml | cut -d':' -f1
    

    Example of system response:

    kvm01.multinode-ha.int
    kvm03.multinode-ha.int
    kvm02.multinode-ha.int
    
  3. Verify that the volume is replicated and is online for any of the minion IDs from the list obtained in the previous step.

    salt <minion_id> cmd.run 'gluster volume status salt_pki'
    

    Example of system response:

    Status of volume: salt_pki
    Gluster process                             TCP Port  RDMA Port  Online  Pid
    ------------------------------------------------------------------------------
    Brick 192.168.2.241:/srv/glusterfs/salt_pki 49154     0          Y       9211
    Brick 192.168.2.242:/srv/glusterfs/salt_pki 49154     0          Y       8499
    Brick 192.168.2.243:/srv/glusterfs/salt_pki 49154     0          Y       8332
    Self-heal Daemon on localhost               N/A       N/A        Y       6313
    Self-heal Daemon on 192.168.2.242           N/A       N/A        Y       10203
    Self-heal Daemon on 192.168.2.243           N/A       N/A        Y       2068
    
    Task Status of Volume salt_pki
    ------------------------------------------------------------------------------
    There are no active volume tasks
    
  4. Log in to the Salt Master node.

  5. Verify that the salt_pki volume is mounted on each proxy node and the Salt Master node:

    salt -C 'I@nginx:server:site:*:host:protocol:https or I@salt:master' \
    cmd.run 'mount | grep salt_pki'
    

    Example of system response:

    prx01.multinode-ha.int:
        192.168.2.240:/salt_pki on /srv/salt/pki type fuse.glusterfs \
        (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
    prx02.multinode-ha.int:
        192.168.2.240:/salt_pki on /srv/salt/pki type fuse.glusterfs \
        (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
    cfg01.multinode-ha.int:
        192.168.2.240:/salt_pki on /srv/salt/pki type fuse.glusterfs \
        (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
    
  6. Proceed with the renewal or replacement of the NGINX certificates as required.

Renew or replace the NGINX certificates managed by salt-minion

This section describes how to renew or replace the NGINX certificates managed by salt-minion.

To renew or replace the NGINX certificates managed by salt-minion:

  1. Complete the steps described in Verify the GlusterFS share salt_pki.

  2. Log in to the Salt Master node.

  3. Verify the certificate validity date:

    openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: May 30 17:21:10 2018 GMT
    Not After : May 30 17:21:10 2019 GMT
    
  4. Remove your current certificates from the Salt Master node.

    Note

    The following command also removes certificates from all proxy nodes as they use the same GlusterFS share.

    rm -f /srv/salt/pki/*/*.[pemcrt]*
    
  5. If you replace the certificates, remove the private key:

    rm -f /srv/salt/pki/*/proxy.key
    
  6. Renew or replace your certificates by applying the salt.minion state on all proxy nodes one by one:

    salt -C 'I@nginx:server:site:*:host:protocol:https' state.sls salt.minion.cert -b 1
    
  7. Apply the nginx state on all proxy nodes one by one:

    salt -C 'I@nginx:server:site:*:host:protocol:https' state.sls nginx -b 1
    
  8. Verify the new certificate validity date:

    openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: May 30 17:21:10 2018 GMT
    Not After : May 30 17:21:10 2019 GMT
    
Renew the self-managed NGINX certificates

This section describes how to renew the self-managed NGINX certificates.

To renew the self-managed NGINX certificates:

  1. Complete the steps described in Verify the GlusterFS share salt_pki.

  2. Open your project Git repository with the Reclass model on the cluster level.

  3. Update the /openstack/proxy.yml file with the following configuration as an example:

    parameters:
      _params:
        nginx_proxy_ssl:
          enabled: true
          mode: secure
          key_file:  /srv/salt/pki/${_param:cluster_name}/FQDN_PROXY_CERT.key
          cert_file: /srv/salt/pki/${_param:cluster_name}/FQDN_PROXY_CERT.crt
          chain_file: /srv/salt/pki/${_param:cluster_name}/FQDN_PROXY_CERT_CHAIN.crt
          key: |
            -----BEGIN PRIVATE KEY-----
            MIIJRAIBADANBgkqhkiG9w0BAQEFAASCCS4wggkqAgEAAoICAQC3qXiZiugf6HlR
            ...
            aXK0Fg1hJKu60Oh+E5H1d+ZVbP30xpdQ
            -----END PRIVATE KEY-----
          cert: |
            -----BEGIN CERTIFICATE-----
            MIIHDzCCBPegAwIBAgIDLYclMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
            ...
            lHfjP1c6iWAL0YEp1IMCeM01l4WWj0ymb7f4wgOzcULfwzU=
            -----END CERTIFICATE-----
          chain: |
            -----BEGIN CERTIFICATE-----
            MIIFgDCCA2igAwIBAgIDET0sMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
            ...
            UPwFzYIVkwy4ny+UJm9js8iynKro643mXty9vj5TdN1iK3ZA4f4/7kenuHtGBNur
            WzUuf8H9dBW2DPtk5Jq/+QWtYMs=
            -----END CERTIFICATE-----
            -----BEGIN CERTIFICATE-----
            MIIGXzCCBEegAwIBAgIDEUB0MA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
            ...
            /inxvBr89TvbCP2hweGMD6w1mKJU2SWEQwMs7P72dU7VuVqyyoutMWakJZ+xoGE9
            YqQO
            -----END CERTIFICATE-----
            -----BEGIN CERTIFICATE-----
            MIIHDzCCBPegAwIBAgIDLYclMA0GCSqGSIb3DQEBCwUAMFkxEzARBgoJkiaJk/Is
            ...
            lHfjP1c6iWAL0YEp1IMCeM01l4WWj0ymb7f4wgOzcULfwzU=
            -----END CERTIFICATE-----
    

    Note

    Modify the example above by adding your certificates and key:

    • If you renew the certificates, leave your existing key and update the cert and chain sections.

    • If you replace the certificates, modify all three sections.

    Note

    The key, cert, and chain sections are optional. You can select from the following options:

    • Store certificates in the file system in /srv/salt/pki/**/ and add the key_file, cert_file, and chain_file lines to /openstack/proxy.yml.

    • Add only the key, cert, and chain sections without the key_file, cert_file, and chain_file lines to /openstack/proxy.yml. The certificates are stored under the /etc directory as default paths in the Salt formula.

    • Use all three sections, as in the example above. All content is available in pillar and is stored in /srv/salt/pki/** as well. This option requires manual upload of the certificates and key files content to the .yml files.

  4. Log in to the Salt Master node.

  5. Verify the new certificate validity date:

    openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: May 30 17:21:10 2018 GMT
    Not After : May 30 17:21:10 2019 GMT
    
  6. Remove the current certificates.

    Note

    The following command also removes certificates from all proxy nodes as they use the same GlusterFS share.

    rm -f /srv/salt/pki/*/*.[pemcrt]*
    
  7. If you replace the certificates, remove the private key:

    /srv/salt/pki/*/proxy.key
    
  8. Apply the nginx state on all proxy nodes one by one:

    salt -C 'I@nginx:server' state.sls nginx -b 1
    
  9. Verify the new certificate validity date:

    openssl x509 -in /srv/salt/pki/*/proxy.crt -text -noout | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: May 30 17:21:10 2018 GMT
    Not After : May 30 17:21:10 2019 GMT
    
  10. Restart the NGINX services and remove the VIP before restart:

    salt -C 'I@nginx:server' cmd.run 'service keepalived stop; sleep 5; \
    service nginx restart; service keepalived start' -b 1
    

HAProxy certificates

This section describes how to renew or replace the HAProxy certificates managed by either salt-minion or self-managed certificates using pillars.

Renew or replace the HAProxy certificates managed by salt-minion

This section describes how to renew or replace the HAProxy certificates managed by salt-minion.

To renew or replace the HAProxy certificates managed by salt-minion:

  1. Log in to the Salt Master node.

  2. Obtain the list of the HAProxy minions IDs where the certificate should be replaced:

    salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    pillar.get _nonexistent | cut -d':' -f1
    

    Example of system response:

    cid02.multinode-ha.int
    cid03.multinode-ha.int
    cid01.multinode-ha.int
    
  3. Verify the certificate validity date for each HAProxy minion listed in the output of the above command:

    for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
    pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | \
    cut -d"'" -f4 | sort | uniq | tr '\n' ' '); do salt -C ${m} \
    cmd.run "openssl x509 -in ${c} -text | egrep -i 'after|before'"; done; done;
    

    Example of system response:

    cid02.multinode-ha.int:
                    Not Before: May 29 12:58:21 2018 GMT
                    Not After : May 29 12:58:21 2019 GMT
    
  4. Remove your current certificates from each HAProxy minion:

    for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
    pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
    -f4 | sort | uniq | sed s/-all.pem/.crt/ | tr '\n' ' '); \
    do salt -C ${m} cmd.run "rm -f ${c}"; done; done; \
    for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
    pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
    -f4 | sort | uniq | tr '\n' ' '); do salt -C ${m} cmd.run "rm -f ${c}"; done; done; \
    salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    cmd.run 'rm -f /etc/haproxy/ssl/salt_master_ca-ca.crt'
    
  5. If you replace the certificates, remove the private key:

    for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
    pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
    -f4 | sort | uniq | sed s/-all.pem/.key/ | tr '\n' ' '); \
    do salt -C ${m} cmd.run "rm -f ${c}"; done; done;
    
  6. Apply the salt.minion.grains state for all HAProxy nodes to retrieve the CA certificate from Salt Master:

    salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' state.sls salt.minion.grains
    
  7. Apply the salt.minion.cert state for all HAProxy nodes:

    salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' state.sls salt.minion.cert
    
  8. Verify the new certificate validity date:

    for m in $(salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    pillar.get _nonexistent | cut -d':' -f1); do for c in $(salt -C ${m} \
    pillar.get 'haproxy:proxy:listen' --out=txt | egrep -o "'pem_file': '\S+'" | cut -d"'" \
    -f4 | sort | uniq | tr '\n' ' '); do salt -C ${m} \
    cmd.run "openssl x509 -in ${c} -text | egrep -i 'after|before'"; done; done;
    

    Example of system response:

    cid02.multinode-ha.int:
                    Not Before: Jun  6 17:24:09 2018 GMT
                    Not After : Jun  6 17:24:09 2019 GMT
    
  9. Restart the HAProxy services on each HAProxy minion and remove the VIP before restart:

    salt -C 'I@haproxy:proxy:listen:*:binds:ssl:enabled:true' \
    cmd.run 'service keepalived stop; sleep 5; \
    service haproxy stop; service haproxy start; service keepalived start' -b 1
    
Renew or replace the self-managed HAProxy certificates

This section describes how to renew or replace the self-managed HAProxy certificates.

To renew or replace the self-managed HAProxy certificates:

  1. Log in to the Salt Master node.

  2. Verify the certificate validity date:

    for node in $(salt -C 'I@haproxy:proxy' test.ping --output yaml | cut -d':' -f1); do
      for name in $(salt ${node} pillar.get haproxy:proxy --output=json | jq '.. \
      | .listen? | .. | .ssl? | .pem_file?' | grep -v null | sort | uniq); do
        salt ${node} cmd.run "openssl x509 -in ${name} -text -noout | grep -Ei 'after|before'";
      done;
    done;
    

    Note

    In the command above, the pem_file value is used to specify the explicit certificate path.

    Example of system response:

    cid02.multinode-ha.int:
                    Not Before: May 25 15:32:17 2018 GMT
                    Not After : May 25 15:32:17 2019 GMT
    cid01.multinode-ha.int:
                    Not Before: May 25 15:29:17 2018 GMT
                    Not After : May 25 15:29:17 2019 GMT
    cid03.multinode-ha.int:
                    Not Before: May 25 15:21:17 2018 GMT
                    Not After : May 25 15:21:17 2019 GMT
    
  3. Open your project Git repository with Reclass model on the cluster level.

  4. For each class file with the HAProxy class enabled, update its pillar values with the following configuration as an example:

    parameters:
      _params:
        haproxy_proxy_ssl:
          enabled: true
          mode: secure
          key: |
            -----BEGIN RSA PRIVATE KEY-----
            MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
            ...
            0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
            -----END RSA PRIVATE KEY-----
          cert: |
            -----BEGIN CERTIFICATE-----
            MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
            ...
            /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
            -----END CERTIFICATE-----
          chain: |
            -----BEGIN RSA PRIVATE KEY-----
            MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
            ...
            0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
            -----END RSA PRIVATE KEY-----
            -----BEGIN CERTIFICATE-----
            MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
            ...
            /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
            -----END CERTIFICATE-----
            -----BEGIN CERTIFICATE-----
            MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
            ...
            M8IfJ5I=
            -----END CERTIFICATE-----
    

    Note

    Modify the example above by adding your certificates and key:

    • If you renew the certificates, leave your existing key and update the cert and chain sections.

    • If you replace the certificates, modify all three sections.

  5. Remove your current certificates from the HAProxy nodes:

    for node in $(salt -C 'I@haproxy:proxy' test.ping --output yaml | cut -d':' -f1); do
      for name in $(salt ${node} pillar.get haproxy:proxy --output=json | jq '.. \
      | .listen? | .. | .ssl? | .pem_file?' | grep -v null | sort | uniq); do
        salt ${node} cmd.run "rm -f ${name}";
      done;
    done;
    
  6. Apply the haproxy.proxy state on all HAProxy nodes one by one:

    salt -C 'I@haproxy:proxy' state.sls haproxy.proxy -b 1
    
  7. Verify the new certificate validity date:

    for node in $(salt -C 'I@haproxy:proxy' test.ping --output yaml | cut -d':' -f1); do
      for name in $(salt ${node} pillar.get haproxy:proxy --output=json | jq '.. \
      | .listen? | .. | .ssl? | .pem_file?' | grep -v null | sort | uniq); do
        salt ${node} cmd.run "openssl x509 -in ${name} -text -noout | grep -Ei 'after|before'";
      done;
    done;
    

    Example of system response:

    cid02.multinode-ha.int:
                    Not Before: May 25 15:29:17 2018 GMT
                    Not After : May 25 15:29:17 2019 GMT
    cid03.multinode-ha.int:
                    Not Before: May 25 15:29:17 2018 GMT
                    Not After : May 25 15:29:17 2019 GMT
    cid01.multinode-ha.int:
                    Not Before: May 25 15:29:17 2018 GMT
                    Not After : May 25 15:29:17 2019 GMT
    
  8. Restart the HAProxy services one by one and remove the VIP before restart:

    salt -C 'I@haproxy:proxy' cmd.run 'service keepalived stop; sleep 5; \
    service haproxy stop; service haproxy start; service keepalived start' -b 1
    

Apache certificates

This section describes how to renew or replace the Apache certificates managed by either salt-minion or self-managed certificates using pillars.

Renew or replace the Apache certificates managed by salt-minion

This section describes how to renew or replace the Apache certificates managed by salt-minion.

Warning

If you replace or renew the Apache certificates after the Salt Master CA certificate has been replaced, make sure that both new and old CA certificates are published as described in Publish CA certificates.

To renew or replace the Apache certificates managed by salt-minion:

  1. Log in to the Salt Master node.

  2. Verify your current certificate validity date:

    salt -C 'I@apache:server' cmd.run 'openssl x509 \
    -in /etc/ssl/certs/internal_proxy.crt -text -noout | grep -Ei "after|before"'
    

    Example of system response:

    ctl02.multinode-ha.int:
                    Not Before: May 29 12:58:21 2018 GMT
                    Not After : May 29 12:58:21 2019 GMT
    ctl03.multinode-ha.int:
                    Not Before: May 29 12:58:25 2018 GMT
                    Not After : May 29 12:58:25 2019 GMT
    ctl01.multinode-ha.int:
                    Not Before: Apr 27 12:37:28 2018 GMT
                    Not After : Apr 27 12:37:28 2019 GMT
    
  3. Remove your current certificates from the Apache nodes:

    salt -C 'I@apache:server' cmd.run 'rm -f /etc/ssl/certs/internal_proxy.crt'
    
  4. If you replace the certificates, remove the private key:

    salt -C 'I@apache:server' cmd.run 'rm -f /etc/ssl/private/internal_proxy.key'
    
  5. Renew or replace your certificates by applying the salt.minion.cert state on all Apache nodes one by one:

    salt -C 'I@apache:server' state.sls salt.minion.cert
    
  6. Refresh the CA chain:

    salt -C 'I@apache:server' cmd.run 'cat /etc/ssl/certs/internal_proxy.crt \
    /usr/local/share/ca-certificates/ca-salt_master_ca.crt > \
    /etc/ssl/certs/internal_proxy-with-chain.crt; \
    chmod 0644 /etc/ssl/certs/internal_proxy-with-chain.crt; \
    chown root:root /etc/ssl/certs/internal_proxy-with-chain.crt'
    
  7. Verify the new certificate validity date:

    salt -C 'I@apache:server' cmd.run 'openssl x509 \
    -in /etc/ssl/certs/internal_proxy.crt -text -noout | grep -Ei "after|before"'
    

    Example of system response:

    ctl02.multinode-ha.int:
                    Not Before: Jun  6 17:24:09 2018 GMT
                    Not After : Jun  6 17:24:09 2019 GMT
    ctl03.multinode-ha.int:
                    Not Before: Jun  6 17:24:42 2018 GMT
                    Not After : Jun  6 17:24:42 2019 GMT
    ctl01.multinode-ha.int:
                    Not Before: Jun  6 17:23:38 2018 GMT
                    Not After : Jun  6 17:23:38 2019 GMT
    
  8. Restart the Apache services one by one:

    salt -C 'I@apache:server' cmd.run 'service apache2 stop; service apache2 start; sleep 60' -b1
    
Replace the self-managed Apache certificates

This section describes how to replace the self-managed Apache certificates.

Warning

If you replace or renew the Apache certificates after the Salt Master CA certificate has been replaced, make sure that both new and old CA certificates are published as described in Publish CA certificates.

To replace the self-managed Apache certificates:

  1. Log in to the Salt Master node.

  2. Verify your current certificate validity date:

    for node in $(salt -C 'I@apache:server' test.ping --output yaml | cut -d':' -f1); do
      for name in $(salt ${node} pillar.get apache:server:site --output=json | \
      jq '.. | .host? | .name?' | grep -v null | sort | uniq); do
        salt ${node} cmd.run "openssl x509 -in /etc/ssl/certs/${name}.crt -text \
        -noout | grep -Ei 'after|before'";
      done;
    done;
    

    Example of system response:

    ctl02.multinode-ha.int:
                    Not Before: May 29 12:58:21 2018 GMT
                    Not After : May 29 12:58:21 2019 GMT
    ctl03.multinode-ha.int:
                    Not Before: May 29 12:58:25 2018 GMT
                    Not After : May 29 12:58:25 2019 GMT
    ctl01.multinode-ha.int:
                    Not Before: Apr 27 12:37:28 2018 GMT
                    Not After : Apr 27 12:37:28 2019 GMT
    
  3. Open your project Git repository with Reclass model on the cluster level.

  4. For each class file with the Apache server class enabled, update the _param:apache_proxy_ssl value with the following configuration as an example:

    parameters:
      _params:
        apache_proxy_ssl:
          enabled: true
          mode: secure
          key: |
            -----BEGIN RSA PRIVATE KEY-----
            MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
            ...
            0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
            -----END RSA PRIVATE KEY-----
          cert: |
            -----BEGIN CERTIFICATE-----
            MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
            ...
            /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
            -----END CERTIFICATE-----
          chain: |
            -----BEGIN RSA PRIVATE KEY-----
            MIIJKAIBAAKCAgEAxSXLtYhzptxcAdnsNy2r8NkgskPm3J/l54hmhuSoL61LpEIi
            ...
            0z/c5yAddRpU/i6/TH2RlBaSGfmoNw/IuFfLsZI2O6dQo4e+QKX+V3JTeNY=
            -----END RSA PRIVATE KEY-----
            -----BEGIN CERTIFICATE-----
            MIIGEzCCA/ugAwIBAgIILX5kuGcAhw8wDQYJKoZIhvcNAQELBQAwSjELMAkGA1UE
            ...
            /in+Y5Wrl1uGHYeFe0yOdb1uxH+PLxc=
            -----END CERTIFICATE-----
            -----BEGIN CERTIFICATE-----
            MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
            ...
            M8IfJ5I=
            -----END CERTIFICATE-----
    

    Note

    Modify the example above by adding your certificates and key:

    • If you renew the certificates, leave your existing key and update the cert and chain sections.

    • If you replace the certificates, modify all three sections.

  5. Remove your current certificates from the Apache nodes:

    for node in $(salt -C 'I@apache:server' test.ping --output yaml | cut -d':' -f1); do
      for name in $(salt ${node} pillar.get apache:server:site --output=json | \
      jq '.. | .host? | .name?' | grep -v null | sort | uniq); do
        salt ${node} cmd.run "rm -f /etc/ssl/certs/${name}.crt";
      done;
    done;
    
  6. Apply the apache.server state on all Apache nodes one by one:

    salt -C 'I@apache:server' state.sls apache.server
    
  7. Verify the new certificate validity date:

    for node in $(salt -C 'I@apache:server' test.ping --output yaml | cut -d':' -f1); do
      for name in $(salt ${node} pillar.get apache:server:site --output=json | \
      jq '.. | .host? | .name?' | grep -v null | sort | uniq); do
        salt ${node} cmd.run "openssl x509 -in /etc/ssl/certs/${name}.crt -text \
        -noout | grep -Ei 'after|before'";
      done;
    done;
    

    Example of system response:

    ctl02.multinode-ha.int:
                    Not Before: Jun  6 17:24:09 2018 GMT
                    Not After : Jun  6 17:24:09 2019 GMT
    ctl03.multinode-ha.int:
                    Not Before: Jun  6 17:24:42 2018 GMT
                    Not After : Jun  6 17:24:42 2019 GMT
    ctl01.multinode-ha.int:
                    Not Before: Jun  6 17:23:38 2018 GMT
                    Not After : Jun  6 17:23:38 2019 GMT
    
  8. Restart the Apache services one by one:

    salt -C 'I@apache:server' cmd.run 'service apache2 stop; service apache2 start' -b 1
    

RabbitMQ certificates

This section describes how to renew or replace the RabbitMQ cluster certificates managed by either salt-minion or self-managed certificates using pillars.

Verify that the RabbitMQ cluster uses certificates

This section describes how to determine whether your RabbitMQ cluster uses certificates and identify their location on the system.

To verify that the RabbitMQ cluster uses certificates:

  1. Log in to the Salt Master node.

  2. Run the following command:

    salt -C 'I@rabbitmq:server' cmd.run "rabbitmqctl environment | \
    grep -E '/ssl/|ssl_listener|protocol_version'"
    

    Example of system response:

    msg02.multinode-ha.int:
              {ssl_listeners,[{"0.0.0.0",5671}]},
                  [{cacertfile,"/etc/rabbitmq/ssl/ca.pem"},
                   {certfile,"/etc/rabbitmq/ssl/cert.pem"},
                   {keyfile,"/etc/rabbitmq/ssl/key.pem"},
         {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},
    msg01.multinode-ha.int:
              {ssl_listeners,[{"0.0.0.0",5671}]},
                  [{cacertfile,"/etc/rabbitmq/ssl/ca.pem"},
                   {certfile,"/etc/rabbitmq/ssl/cert.pem"},
                   {keyfile,"/etc/rabbitmq/ssl/key.pem"},
         {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},
    msg03.multinode-ha.int:
              {ssl_listeners,[{"0.0.0.0",5671}]},
                  [{cacertfile,"/etc/rabbitmq/ssl/ca.pem"},
                   {certfile,"/etc/rabbitmq/ssl/cert.pem"},
                   {keyfile,"/etc/rabbitmq/ssl/key.pem"},
         {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},
    
  3. Proceed to renewal or replacement of your certificates as required.

Renew or replace the RabbitMQ certificates managed by salt-minion

This section describes how to renew or replace the RabbitMQ certificates managed by salt-minion.

To renew or replace the RabbitMQ certificates managed by salt-minion:

  1. Log in to the Salt Master node.

  2. Verify the certificates validity dates:

    salt -C 'I@rabbitmq:server' cmd.run 'openssl x509 \
    -in /etc/rabbitmq/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: Apr 27 12:37:14 2018 GMT
    Not After : Apr 27 12:37:14 2019 GMT
    Not Before: Apr 27 12:37:08 2018 GMT
    Not After : Apr 27 12:37:08 2019 GMT
    Not Before: Apr 27 12:37:13 2018 GMT
    Not After : Apr 27 12:37:13 2019 GMT
    
  3. Remove the certificates from the RabbitMQ nodes:

    salt -C 'I@rabbitmq:server' cmd.run 'rm -f /etc/rabbitmq/ssl/cert.pem'
    
  4. If you replace the certificates, remove the private key:

    salt -C 'I@rabbitmq:server' cmd.run 'rm -f /etc/rabbitmq/ssl/key.pem'
    
  5. Regenerate the certificates on the RabbitMQ nodes:

    salt -C 'I@rabbitmq:server' state.sls salt.minion.cert
    
  6. Verify that the certificates validity dates have changed:

    salt -C 'I@rabbitmq:server' cmd.run 'openssl x509 \
    -in /etc/rabbitmq/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: Jun  4 23:52:40 2018 GMT
    Not After : Jun  4 23:52:40 2019 GMT
    Not Before: Jun  4 23:52:41 2018 GMT
    Not After : Jun  4 23:52:41 2019 GMT
    Not Before: Jun  4 23:52:41 2018 GMT
    Not After : Jun  4 23:52:41 2019 GMT
    
  7. Restart the RabbitMQ services one by one:

    salt -C 'I@rabbitmq:server' cmd.run 'service rabbitmq-server stop; \
    service rabbitmq-server start' -b1
    
  8. Verify the RabbitMQ cluster status:

    salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl cluster_status'
    

    Example of system response:

    msg03.multinode-ha.int:
        Cluster status of node rabbit@msg03
        [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
         {running_nodes,[rabbit@msg01,rabbit@msg02,rabbit@msg03]},
         {cluster_name,<<"openstack">>},
         {partitions,[]},
         {alarms,[{rabbit@msg01,[]},{rabbit@msg02,[]},{rabbit@msg03,[]}]}]
    msg01.multinode-ha.int:
        Cluster status of node rabbit@msg01
        [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
         {running_nodes,[rabbit@msg03,rabbit@msg02,rabbit@msg01]},
         {cluster_name,<<"openstack">>},
         {partitions,[]},
         {alarms,[{rabbit@msg03,[]},{rabbit@msg02,[]},{rabbit@msg01,[]}]}]
    msg02.multinode-ha.int:
        Cluster status of node rabbit@msg02
        [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
         {running_nodes,[rabbit@msg03,rabbit@msg01,rabbit@msg02]},
         {cluster_name,<<"openstack">>},
         {partitions,[]},
         {alarms,[{rabbit@msg03,[]},{rabbit@msg01,[]},{rabbit@msg02,[]}]}]
    
Renew or replace the self-managed RabbitMQ certificates

This section describes how to renew or replace the self-managed RabbitMQ certificates.

To renew or replace the self-managed RabbitMQ certificates:

  1. Open your project Git repository with Reclass model on the cluster level.

  2. Create the /openstack/ssl/rabbitmq.yml file with the following configuration as an example:

    classes:
    - cluster.<cluster_name>.openstack.ssl
    parameters:
      rabbitmq:
         server:
           enabled: true
           ...
           ssl:
             enabled: True
             key: ${_param:rabbitmq_ssl_key}
             cacert_chain: ${_param:rabbitmq_ssl_cacert_chain}
             cert: ${_param:rabbitmq_ssl_cert}
    

    Note

    Substitute <cluster_name> with the appropriate value.

  3. Create the /openstack/ssl/init.yml file with the following configuration as an example:

    parameters:
      _param:
        rabbitmq_ssl_cacert_chain: |
          -----BEGIN CERTIFICATE-----
          MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
          ...
          RHXc4FoWv9/n8ZcfsqjQCjF3vUUZBB3zdlfLCLJRruB4xxYukc3gFpFLm21+0ih+
          M8IfJ5I=
          -----END CERTIFICATE-----
        rabbitmq_ssl_key: |
          -----BEGIN RSA PRIVATE KEY-----
          MIIJKQIBAAKCAgEArVSJ16ePjCik+6bZBzhiu3enXw8R9Ms1k4x57633IX1sEZTJ
          ...
          0VgM2bDSNyUuiwCbOMK0Kyn+wGeHF/jGSbVsxYI4OeLFz8gdVUqm7olJj4j3xemY
          BlWVHRa/dEG1qfSoqFU9+IQTd+U42mtvvH3oJHEXK7WXzborIXTQ/08Ztdvy
          -----END RSA PRIVATE KEY-----
        rabbitmq_ssl_cert: |
          -----BEGIN CERTIFICATE-----
          MIIGIDCCBAigAwIBAgIJAJznLlNteaZFMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
          ...
          MfXPTUI+7+5WQLx10yavJ2gOhdyVuDVagfUM4epcriJbACuphDxHj45GINOGhaCd
          UVVCxqnB9qU16ea/kB3Yzsrus7egr9OienpDCFV2Q/kgUSc7
          -----END CERTIFICATE-----
    

    Note

    Modify the example above by adding your certificates and key:

    • If you renew the certificates, leave your existing key and update the cert and chain sections.

    • If you replace the certificates, modify all three sections.

  4. Update the /openstack/message_queue.yml file by adding the newly created class to the RabbitMQ nodes:

    classes:
    - service.rabbitmq.server.ssl
    - cluster.<cluster_name>.openstack.ssl.rabbitmq
    
  5. Log in to the Salt Master node.

  6. Refresh pillars:

    salt -C 'I@rabbitmq:server' saltutil.refresh_pillar
    
  7. Publish new certificates

    salt -C 'I@rabbitmq:server' state.sls rabbitmq -l debug
    
  8. Verify the new certificates validity dates:

    salt -C 'I@rabbitmq:server' cmd.run 'openssl x509 \
    -in /etc/rabbitmq/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: Apr 27 12:37:14 2018 GMT
    Not After : Apr 27 12:37:14 2019 GMT
    Not Before: Apr 27 12:37:08 2018 GMT
    Not After : Apr 27 12:37:08 2019 GMT
    Not Before: Apr 27 12:37:13 2018 GMT
    Not After : Apr 27 12:37:13 2019 GMT
    
  9. Restart the RabbitMQ services one by one:

    salt -C 'I@rabbitmq:server' cmd.run 'service rabbitmq-server stop; \
    service rabbitmq-server start' -b1
    
  10. Verify the RabbitMQ cluster status:

    salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl cluster_status'
    

    Example of system response:

    msg03.multinode-ha.int:
        Cluster status of node rabbit@msg03
        [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
         {running_nodes,[rabbit@msg01,rabbit@msg02,rabbit@msg03]},
         {cluster_name,<<"openstack">>},
         {partitions,[]},
         {alarms,[{rabbit@msg01,[]},{rabbit@msg02,[]},{rabbit@msg03,[]}]}]
    msg01.multinode-ha.int:
        Cluster status of node rabbit@msg01
        [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
         {running_nodes,[rabbit@msg03,rabbit@msg02,rabbit@msg01]},
         {cluster_name,<<"openstack">>},
         {partitions,[]},
         {alarms,[{rabbit@msg03,[]},{rabbit@msg02,[]},{rabbit@msg01,[]}]}]
    msg02.multinode-ha.int:
        Cluster status of node rabbit@msg02
        [{nodes,[{disc,[rabbit@msg01,rabbit@msg02,rabbit@msg03]}]},
         {running_nodes,[rabbit@msg03,rabbit@msg01,rabbit@msg02]},
         {cluster_name,<<"openstack">>},
         {partitions,[]},
         {alarms,[{rabbit@msg03,[]},{rabbit@msg01,[]},{rabbit@msg02,[]}]}]
    
  11. Restart all OpenStack API services and agents.

MySQL/Galera certificates

This section describes how to renew or replace the MySQL/Galera certificates managed by either salt-minion or self-managed certificates using pillars.

Verify that the MySQL/Galera cluster uses certificates

This section describes how to determine whether your MySQL/Galera cluster uses certificates and identify their location on the system.

To verify that the MySQL/Galera cluster uses certificates:

  1. Log in to the Salt Master node.

  2. Run the following command:

    salt -C 'I@galera:master' mysql.showglobal | grep -EB3 '(have_ssl|ssl_(key|ca|cert))$'
    

    Example of system response:

    Value:
        YES
    Variable_name:
        have_ssl
    
    Value:
        /etc/mysql/ssl/ca.pem
    Variable_name:
        ssl_ca
    
    Value:
        /etc/mysql/ssl/cert.pem
    Variable_name:
        ssl_cert
    
    Value:
        /etc/mysql/ssl/key.pem
    Variable_name:
        ssl_key
    
  3. Proceed to renewal or replacement of your certificates as required.

Renew or replace the MySQL/Galera certificates managed by salt-minion

This section describes how to renew or replace the MySQL/Galera certificates managed by salt-minion.

Prerequisites:

  1. Log in to the Salt Master node.

  2. Verify that the MySQL/Galera cluster is up and synced:

    salt -C 'I@galera:master' mysql.status | grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'
    

    Example of system response:

    wsrep_cluster_size:
        3
    
    wsrep_incoming_addresses:
        192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306
    
    wsrep_local_state_comment:
        Synced
    
  3. Verify that the log files have no errors:

    salt -C 'I@galera:master or I@galera:slave' cmd.run 'cat /var/log/mysql/error.log |grep ERROR|wc -l'
    

    Example of system response:

    dbs01.multinode-ha.int
        0
    
    dbs02.multinode-ha.int
        0
    
    dbs03.multinode-ha.int
        0
    

    Any value except 0 in the output indicates that the log files include errors. Review them before proceeding to operations with MySQL/Galera.

  4. Verify that the ca-salt_master_ca certificate is available on all nodes with MySQL/Galera:

    salt -C 'I@galera:master or I@galera:slave' cmd.run 'ls /usr/local/share/ca-certificates/ca-salt_master_ca.crt'
    

    Example of system response:

    dbs01.multinode-ha.int
        /usr/local/share/ca-certificates/ca-salt_master_ca.crt
    
    dbs02.multinode-ha.int
        /usr/local/share/ca-certificates/ca-salt_master_ca.crt
    
    dbs03.multinode-ha.int
        /usr/local/share/ca-certificates/ca-salt_master_ca.crt
    

To renew or replace the MySQL/Galera certificates managed by salt-minion:

  1. Log in to the Salt Master node.

  2. Obtain the list of the Galera cluster minions:

    salt -C 'I@galera:master or I@galera:slave' pillar.get _nonexistent | cut -d':' -f1
    

    Example of system response:

    dbs02.multinode-ha.int
    dbs03.multinode-ha.int
    dbs01.multinode-ha.int
    
  3. Verify the certificates validity dates:

    salt -C 'I@galera:master' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    salt -C 'I@galera:slave' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: May 30 17:21:10 2018 GMT
    Not After : May 30 17:21:10 2019 GMT
    Not Before: May 30 17:25:24 2018 GMT
    Not After : May 30 17:25:24 2019 GMT
    Not Before: May 30 17:26:52 2018 GMT
    Not After : May 30 17:26:52 2019 GMT
    
  4. Prepare the Galera nodes to work with old one and new Salt Master CA certificates:

    salt -C 'I@galera:master or I@galera:slave' cmd.run 'cat /usr/local/share/ca-certificates/ca-salt_master_ca.crt /usr/local/share/ca-certificates/ca-salt_master_ca_old.crt > /etc/mysql/ssl/ca.pem'
    
  5. Verify that the necessary files are present in the ssl directory:

    salt -C 'I@galera:master or I@galera:slave' cmd.run 'ls /etc/mysql/ssl'
    

    Example of system response:

    dbs01.multinode-ha.int
        ca.pem
        cert.pem
        key.pem
    
    dbs02.multinode-ha.int
        ca.pem
        cert.pem
        key.pem
    
    dbs03.multinode-ha.int
        ca.pem
        cert.pem
        key.pem
    
  6. Identify the Galera nodes minions IDs:

    • For the Galera master node:

      salt -C 'I@galera:master' test.ping --output yaml | cut -d':' -f1
      

      Example of system response:

      dbs01.multinode-ha.int
      
    • For the Galera slave nodes:

      salt -C 'I@galera:slave' test.ping --output yaml | cut -d':' -f1
      

      Example of system response:

      dbs02.multinode-ha.int
      dbs03.multinode-ha.int
      
  7. Restart the MySQL service for every Galera minion ID one by one. After each Galera minion restart, verify the Galera cluster size and status. Proceed to the next Galera minion restart only if the Galera cluster is synced.

    • To restart the MySQL service for a Galera minion:

      salt <minion_ID> service.stop mysql
      salt <minion_ID> service.start mysql
      
    • To verify the Galera cluster size and status:

      salt -C 'I@galera:master' mysql.status | grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'
      

      Example of system response:

      wsrep_cluster_size:
          3
      
      wsrep_incoming_addresses:
          192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306
      
      wsrep_local_state_comment:
          Synced
      
  8. If you replace the certificates, remove the private key:

    salt -C 'I@galera:master' cmd.run 'mv /etc/mysql/ssl/key.pem /root'
    
  9. Force the certificates regeneration for the Galera master node:

    salt -C 'I@galera:master' cmd.run 'mv /etc/mysql/ssl/cert.pem /root; mv /etc/mysql/ssl/ca.pem /root'
    
    salt -C 'I@galera:master' state.sls salt.minion.cert -l debug
    
    salt -C 'I@galera:master' cmd.run 'cat /usr/local/share/ca-certificates/ca-salt_master_ca.crt /usr/local/share/ca-certificates/ca-salt_master_ca_old.crt > /etc/mysql/ssl/ca.pem'
    
  10. Verify that the certificates validity dates have changed:

    salt -C 'I@galera:master' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: Jun  4 16:14:24 2018 GMT
    Not After : Jun  4 16:14:24 2019 GMT
    
  11. Verify that the necessary files are present in the ssl directory on the Galera master node:

    salt -C 'I@galera:master' cmd.run 'ls /etc/mysql/ssl'
    

    Example of system response:

    dbs01.multinode-ha.int
        ca.pem
        cert.pem
        key.pem
    
  12. Restart the MySQL service on the Galera master node:

    salt -C 'I@galera:master' service.stop mysql
    salt -C 'I@galera:master' service.start mysql
    
  13. Verify that the Galera cluster status is up. For details, see the step 7.

  14. If you replace the certificates, remove the private key:

    salt -C 'I@galera:slave' cmd.run 'mv /etc/mysql/ssl/key.pem /root'
    
  15. Force the certificates regeneration for the Galera slave nodes:

    salt -C 'I@galera:slave' cmd.run 'mv /etc/mysql/ssl/cert.pem /root; mv /etc/mysql/ssl/ca.pem /root'
    
    salt -C 'I@galera:slave' state.sls salt.minion.cert -l debug
    
    salt -C 'I@galera:slave' cmd.run 'cat /usr/local/share/ca-certificates/ca-salt_master_ca.crt /usr/local/share/ca-certificates/ca-salt_master_ca_old.crt > /etc/mysql/ssl/ca.pem'
    
  16. Verify that the necessary files are present in the ssl directory on the Galera slave nodes:

    salt -C 'I@galera:slave' cmd.run 'ls /etc/mysql/ssl'
    

    Example of system response:

    dbs02.multinode-ha.int
        ca.pem
        cert.pem
        key.pem
    
    dbs03.multinode-ha.int
        ca.pem
        cert.pem
        key.pem
    
  17. Verify that the certificates validity dates have changed:

    salt -C 'I@galera:slave' cmd.run 'openssl x509 -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: Jun  4 16:14:24 2018 GMT
    Not After : Jun  4 16:14:24 2019 GMT
    Not Before: Jun  4 16:14:31 2018 GMT
    Not After : Jun  4 16:14:31 2019 GMT
    
  18. Restart the MySQL service for every Galera slave minion ID one by one. After each Galera slave minion restart, verify the Galera cluster size and status. Proceed to the next Galera slave minion restart only if the Galera cluster is synced. For details, see the step 7.

Renew or replace the self-managed MySQL/Galera certificates

This section describes how to renew or replace the self-managed MySQL/Galera certificates.

To renew or replace the self-managed MySQL/Galera certificates:

  1. Log in to the Salt Master node.

  2. Create the classes/cluster/<cluster_name>/openstack/ssl/galera_master.yml file with the following configuration as an example:

    classes:
    - cluster.<cluster_name>.openstack.ssl
    parameters:
      galera:
        master:
          ssl:
            enabled: True
            cacert_chain: ${_param:galera_ssl_cacert_chain}
            key: ${_param:galera_ssl_key}
            cert: ${_param:galera_ssl_cert}
            ca_file: ${_param:mysql_ssl_ca_file}
            key_file: ${_param:mysql_ssl_key_file}
            cert_file: ${_param:mysql_ssl_cert_file}
    

    Note

    Substitute <cluster_name> with the appropriate value.

  3. Create the classes/cluster/<cluster_name>/openstack/ssl/galera_slave.yml file with the following configuration as an example:

    classes:
    - cluster.<cluster_name>.openstack.ssl
    parameters:
      galera:
        slave:
          ssl:
            enabled: True
            cacert_chain: ${_param:galera_ssl_key}
            key: ${_param:galera_ssl_key}
            cert: ${_param:galera_ssl_key}
            ca_file: ${_param:mysql_ssl_ca_file}
            key_file: ${_param:mysql_ssl_key_file}
            cert_file: ${_param:mysql_ssl_cert_file}
    

    Note

    Substitute <cluster_name> with the appropriate value.

  4. Create the classes/cluster/<cluster_name>/openstack/ssl/init.yml file with the following configuration as an example:

    parameters:
      _param:
        mysql_ssl_key_file: /etc/mysql/ssl/key.pem
        mysql_ssl_cert_file: /etc/mysql/ssl/cert.pem
        mysql_ssl_ca_file: /etc/mysql/ssl/ca.pem
        galera_ssl_cacert_chain: |
          -----BEGIN CERTIFICATE-----
          MIIF0TCCA7mgAwIBAgIJAOkTQnjLz6rEMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
          ...
          RHXc4FoWv9/n8ZcfsqjQCjF3vUUZBB3zdlfLCLJRruB4xxYukc3gFpFLm21+0ih+
          M8IfJ5I=
          -----END CERTIFICATE-----
        galera_ssl_key: |
          -----BEGIN RSA PRIVATE KEY-----
          MIIJKQIBAAKCAgEArVSJ16ePjCik+6bZBzhiu3enXw8R9Ms1k4x57633IX1sEZTJ
          ...
          0VgM2bDSNyUuiwCbOMK0Kyn+wGeHF/jGSbVsxYI4OeLFz8gdVUqm7olJj4j3xemY
          BlWVHRa/dEG1qfSoqFU9+IQTd+U42mtvvH3oJHEXK7WXzborIXTQ/08Ztdvy
          -----END RSA PRIVATE KEY-----
        galera_ssl_cert: |
          -----BEGIN CERTIFICATE-----
          MIIGIDCCBAigAwIBAgIJAJznLlNteaZFMA0GCSqGSIb3DQEBCwUAMEoxCzAJBgNV
          ...
          MfXPTUI+7+5WQLx10yavJ2gOhdyVuDVagfUM4epcriJbACuphDxHj45GINOGhaCd
          UVVCxqnB9qU16ea/kB3Yzsrus7egr9OienpDCFV2Q/kgUSc7
          -----END CERTIFICATE-----
    

    Note

    Modify the example above by adding your certificates and key:

    • If you renew the certificates, leave your existing key and update the cert and chain sections.

    • If you replace the certificates, modify all three sections.

  5. Update the classes/cluster/<cluster_name>/infra/config.yml file by adding the newly created classes to the database nodes:

    openstack_database_node01:
      params:
        linux_system_codename: xenial
        deploy_address: ${_param:openstack_database_node01_deploy_address}
      classes:
      - cluster.${_param:cluster_name}.openstack.database_init
      - cluster.${_param:cluster_name}.openstack.ssl.galera_master
    openstack_database_node02:
      params:
        linux_system_codename: xenial
        deploy_address: ${_param:openstack_database_node02_deploy_address}
      classes:
      - cluster.${_param:cluster_name}.openstack.ssl.galera_slave
    openstack_database_node03:
      params:
        linux_system_codename: xenial
        deploy_address: ${_param:openstack_database_node03_deploy_address}
      classes:
      - cluster.${_param:cluster_name}.openstack.ssl.galera_slave
    
  6. Regenerate the Reclass storage:

    salt-call state.sls reclass.storage -l debug
    
  7. Refresh pillars:

    salt -C 'I@galera:master or I@galera:slave' saltutil.refresh_pillar
    
  8. Verify the certificates validity dates:

    salt -C 'I@galera:master' cmd.run 'openssl x509 \
    -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    salt -C 'I@galera:slave' cmd.run 'openssl x509 \
    -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: May 30 17:21:10 2018 GMT
    Not After : May 30 17:21:10 2019 GMT
    Not Before: May 30 17:25:24 2018 GMT
    Not After : May 30 17:25:24 2019 GMT
    Not Before: May 30 17:26:52 2018 GMT
    Not After : May 30 17:26:52 2019 GMT
    
  9. Force the certificate regeneration on the Galera master node:

    salt -C 'I@galera:master' state.sls galera -l debug
    
  10. Verify the new certificates validity dates on the Galera master node:

    salt -C 'I@galera:master' cmd.run 'openssl x509 \
    -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    
  11. Restart the MySQL service on the Galera master node:

    salt -C 'I@galera:master' service.stop mysql
    salt -C 'I@galera:master' service.start mysql
    
  12. Verify that the Galera cluster status is up:

    salt -C 'I@galera:master' mysql.status | \
    grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'
    

    Example of system response:

    wsrep_cluster_size:
        3
    
    wsrep_incoming_addresses:
        192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306
    
    wsrep_local_state_comment:
        Synced
    
  13. Force the certificate regeneration on the Galera slave nodes:

    salt -C 'I@galera:slave' state.sls galera -l debug
    
  14. Verify that the certificates validity dates have changed:

    salt -C 'I@galera:slave' cmd.run 'openssl x509 \
    -in /etc/mysql/ssl/cert.pem -text -noout' | grep -Ei 'after|before'
    

    Example of system response:

    Not Before: Jun  4 16:14:24 2018 GMT
    Not After : Jun  4 16:14:24 2019 GMT
    Not Before: Jun  4 16:14:31 2018 GMT
    Not After : Jun  4 16:14:31 2019 GMT
    
  15. Obtain the Galera slave nodes minions IDs:

    salt -C 'I@galera:slave' test.ping --output yaml | cut -d':' -f1
    

    Example of system response:

    dbs02.multinode-ha.int
    dbs03.multinode-ha.int
    
  16. Restart the MySQL service for every Galera slave minion ID one by one. After each Galera slave minion restart, verify the Galera cluster size and status. Proceed to the next Galera slave minion restart only if the Galera cluster is synced.

    • To restart the MySQL service for a Galera slave minion:

      salt <minion_ID> service.stop mysql
      salt <minion_ID> service.start mysql
      
    • To verify the Galera cluster size and status:

      salt -C 'I@galera:master' mysql.status | \
      grep -EA1 'wsrep_(local_state_c|incoming_a|cluster_size)'
      

      Example of system response:

      wsrep_cluster_size:
          3
      
      wsrep_incoming_addresses:
          192.168.2.52:3306,192.168.2.53:3306,192.168.2.51:3306
      
      wsrep_local_state_comment:
          Synced
      

Change the certificate validity period

You can change a certificate validity period by managing the validity period of the signing policy, which is used for certificates generation and is set to 365 days by default.

Note

The procedure does not update the CA certificates and does not change the signing policy itself.

To change the certificate validity period:

  1. Log in to the Salt Master node.

  2. In classes/cluster/<cluster_name>/infra/config/init.yml, specify the following pillar:

    parameters:
      _param:
        salt_minion_ca_days_valid_certificate: <required_value>
    
  3. Apply the changes:

    salt '*' saltutil.sync_all
    salt -C 'I@salt:master' state.sls salt.minion.ca
    salt -C 'I@salt:master' state.sls salt.minion
    
  4. Remove the certificate you need to update.

  5. Apply the following state:

    salt -C '<target_node>' state.sls salt.minion.cert
    
  6. Verify the end date of the updated certificate:

    salt -C <taget_node> cmd.run 'openssl x509 -enddate -noout -in <path_to_cert>'
    

Enable FQDN on internal endpoints in the Keystone catalog

In the new MCP 2019.2.3 deployments, the OpenStack environments use FQDN on the internal endpoints in the Keystone catalog by default.

In the existing MCP deployments, the IP addresses are used on the internal Keystone endpoints. This section instructs you on how to enable FQDN on the internal endpoints for the existing MCP deployments updated to the MCP 2019.2.3 or newer version.

To enable FQDN on the Keystone internal endpoints:

  1. Verify that you have updated MCP DriveTrain to the 2019.2.3 or newer version as described in Update DriveTrain.

  2. Log in to the Salt Master node.

  3. On the system Reclass level:

    1. Verify that there are classes present under the /srv/salt/reclass/classes/system/linux/network/hosts/openstack directory.

    2. Verify that the following parameters are set in defaults/openstack/init.yml as follows:

      parameters:
        _param:
          openstack_service_hostname: os-ctl-vip
          openstack_service_host: ${_param:openstack_service_hostname}.${linux:system:domain}
      
    3. If you have the extra OpenStack services installed, define the additional parameters in defaults/openstack/init.yml as required:

      • For Manila:

        parameters:
          _param:
            openstack_share_service_hostname: os-share-vip
            openstack_share_service_host: ${_param:openstack_share_service_hostname}.${linux:system:domain}
        
      • For Barbican:

        parameters:
          _param:
            openstack_kmn_service_hostname: os-kmn-vip
            openstack_kmn_service_host: ${_param:openstack_kmn_service_hostname}.${linux:system:domain}
        
      • For Tenant Telemetry:

        parameters:
          _param:
            openstack_telemetry_service_hostname: os-telemetry-vip
            openstack_telemetry_service_host: ${_param:openstack_telemetry_service_hostname}.${linux:system:domain}
        
  4. On the cluster Reclass level, configure the FQDN on internal endpoints by editing infra/init.yml:

    1. Add the following class for the core OpenStack services:

      classes:
        - system.linux.network.hosts.openstack
      
    2. If you have the extra OpenStack services installed, define the additional classes as required:

      • For Manila:

        classes:
          - system.linux.network.hosts.openstack.share
        
      • For Barbican:

        classes:
          - system.linux.network.hosts.openstack.kmn
        
      • For Tenant Telemetry:

        classes:
          - system.linux.network.hosts.openstack.telemetry
        
  5. On the cluster Reclass level, define the following parameters in the openstack/init.yml file:

    1. Define the following parameters for the core OpenStack services:

      parameters:
        _param:
          glance_service_host: ${_param:openstack_service_host}
          keystone_service_host: ${_param:openstack_service_host}
          heat_service_host: ${_param:openstack_service_host}
          cinder_service_host: ${_param:openstack_service_host}
          nova_service_host: ${_param:openstack_service_host}
          placement_service_host: ${_param:openstack_service_host}
          neutron_service_host: ${_param:openstack_service_host}
      
    2. If you have the extra services installed, define the following parameters as required:

      • For Tenant Telemetry:

        parameters:
          _param:
            aodh_service_host: ${_param:openstack_telemetry_service_host}
            ceilometer_service_host: ${_param:openstack_telemetry_service_host}
            panko_service_host: ${_param:openstack_telemetry_service_host}
            gnocchi_service_host: ${_param:openstack_telemetry_service_host}
        
      • For Manila:

        parameters:
          _param:
            manila_service_host: ${_param:openstack_share_service_host}
        
      • For Designate:

        parameters:
          _param:
            designate_service_host: ${_param:openstack_service_host}
        
      • For Barbican:

        parameters:
          _param:
            barbican_service_host: ${_param:openstack_kmn_service_host}
        
  6. Apply the keystone state:

    salt -C 'I@keystone:server' state.apply keystone
    
  7. Log in to one of the OpenStack controller nodes.

  8. Verify that the changes have been applied successfully:

    openstack endpoint list
    
  9. If SSL is used on the Keystone internal endpoints:

    1. If Manila or Telemetry is installed:

      1. Log in to the Salt Master node.

      2. Open the Reclass cluster level of your deployment.

      3. For Manila, edit /openstack/share.yml. For example:

        parameters:
          _param:
            openstack_api_cert_alternative_names: IP:127.0.0.1,IP:${_param:cluster_local_address},IP:${_param:cluster_vip_address},DNS:${linux:system:name},DNS:${linux:network:fqdn},DNS:${_param:cluster_vip_address},DNS:${_param:openstack_share_service_host}
        
      4. For Tenant Telemetry, edit /openstack/telemetry.yml. For example:

        parameters:
          _param:
            openstack_api_cert_alternative_names: IP:128.0.0.1,IP:${_param:cluster_local_address},IP:${_param:cluster_vip_address},DNS:${linux:system:name},DNS:${linux:network:fqdn},DNS:${_param:cluster_vip_address},DNS:${_param:openstack_telemetry_service_host}
        
    2. Renew the OpenStack API certificates to include FQDN in CommonName (CN) as described in Manage certificates.

Enable Keystone security compliance policies

In the MCP OpenStack deployments, you can enable additional Keystone security compliance features independently of each other based on your corporate security policy. All available features apply only to the SQL back end for the Identity driver. By default, all security compliance features are disabled.

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section instructs you on how to enable the Keystone security compliance features on an existing MCP OpenStack deployment. For the new deployments, you can configure the compliance features during the Reclass deployment model creation through Model Designer.

Keystone security compliance parameters

Operation

Enable in Keystone for all SQL back-end users

Override settings for specific users

Force the user to change the password upon the first use

change_password_upon_first_use: True

Forces the user to change their password upon the first use

ignore_change_password_upon_first_use: True

Configure password expiration

password_expires_days: <NUM>

Sets the number of days after which the password would expire

ignore_password_expiry: True

Set an account lockout threshold

lockout_failure_attempts: <NUM>

Sets the maximum number of failed authentication attempts

lockout_duration: <NUM>

Sets the number of minutes (in seconds) after which a user would be locked out

ignore_lockout_failure_attempts: True

Restrict the user from changing their password

N/A

lock_password: True

Configure password strength requirements

password_regex: <STRING> 1

Sets the strength requirements for the passwords

password_regex_description: <STRING>

Provides the text that describes the password strength requirements. Required if the password_regex is set.

N/A

Disable inactive users

disable_user_account_days_inactive: <NUM> 2

Sets the number of days after which the user would be disabled

N/A

Configure a unique password history

unique_last_password_count: <NUM>

Sets the number of passwords for a user that must be unique before an old password can be reused

minimum_password_age: <NUM>

Sets the number of days for the password to be used before the user can change it

N/A

Warning

1

When enabled, it may affect all operations with Heat. Heat creates its service users with its own regex, which is 32 characters long and contains uppercase and lowercase letters, digits, and special characters such as !, @, #, %, ^, & and *. Therefore, not to affect the Heat operations, verify that your custom value for this option allows such generated passwords. Currently, you cannot override the password regex enforcement in Keystone for a specific user.

2

When enabled, it may affect autoscaling and other operations with Heat that require the deferred authentication. If you need to perform such operations in the Heat stack for the first time after deployment upon the defined termination period and the Heat service user created during the deployment has been inactive during this termination period, the Heat service user will be disabled and not able to authenticate. Currently, you cannot override this parameter in Keystone for a specific user.

To enable the security compliance policies:

  1. Log in to the Salt Master node.

  2. Open your Git project repository with the Reclass model on the cluster level.

  3. Open the openstack/control/init.yml file for editing.

  4. Configure the security compliance policies for the OpenStack service users as required.

    • For all OpenStack service users. For example:

      parameters:
        _param:
          openstack_service_user_options:
            ignore_change_password_upon_first_use: True
            ignore_password_expiry: True
            ignore_lockout_failure_attempts: False
            lock_password: False
      
    • For specific OpenStack Queens and newer OpenStack releases service users. For example:

      keystone:
          client:
            resources:
              v3:
                users:
                  cinder:
                    options:
                      ignore_change_password_upon_first_use: True
                      ignore_password_expiry: False
                      ignore_lockout_failure_attempts: False
                      lock_password: True
      
    • For specific OpenStack Pike and older OpenStack releases service users. For example:

      keystone:
         client:
           server:
             identity:
               project:
                 service:
                   user:
                     cinder:
                       options:
                         ignore_change_password_upon_first_use: True
                         ignore_password_expiry: False
                         ignore_lockout_failure_attempts: False
                         lock_password: True
      
  5. Enable the security compliance features on the Keystone server side by defining the related Keystone sever parameters as required.

    Example configuration:

    keystone:
      server:
        security_compliance:
          disable_user_account_days_inactive: 90
          lockout_failure_attempts: 5
          lockout_duration: 600
          password_expires_days: 90
          unique_last_password_count: 10
          minimum_password_age: 0
          password_regex: '^(?=.*\d)(?=.*[a-zA-Z]).{7,}$$'
          password_regex_description: 'Your password must contains at least 1 letter, 1 digit, and have a minimum length of 7 characters'
          change_password_upon_first_use: true
    
  6. Apply the changes:

    salt -C 'I@keystone:client' state.sls keystone.client
    salt -C 'I@keystone:server' state.sls keystone.server
    

Restrict the VM image policy

This section instructs you on how to restrict Glance, Nova, and Cinder snapshot policy to only allow Administrators to manage images and snapshots in your OpenStack environment.

To configure Administrator only policy:

  1. In the /etc/nova directory, create and edit the policy.json for Nova as follows:

    {
        "os_compute_api:servers:create_image": "rule:admin_api",
        "os_compute_api:servers:create_image:allow_volume_backed": "rule:admin_api",
    }
    
  2. In the openstack/control.yml file, restrict managing operations by setting the role:admin value for the following parameters for Glance and Cinder:

    parameters:
      glance:
        server:
          policy:
            add_image: "role:admin"
            delete_image: "role:admin"
            modify_image: "role:admin"
            publicize_image: "role:admin"
            copy_from: "role:admin"
            upload_image: "role:admin"
            delete_image_location: "role:admin"
            set_image_location: "role:admin"
            deactivate: "role:admin"
            reactivate: "role:admin"
      cinder:
        server:
          policy:
            'volume_extension:volume_actions:upload_image': "role:admin"
    
  3. Apply the following states:

    salt 'ctl*' state.sls glance.server,cinder.controller
    
  4. Verify that the rules have changed in the states output.

  5. If the Comment: State 'keystone_policy.rule_present' was not found in SLS 'glance.server' error occurs, synchronize Salt modules and re-apply the glance.server state:

    salt 'ctl*' saltutil.sync_all
    salt 'ctl*' state.sls glance.server
    
  6. To apply the changes, restart the glance-api service:

    salt 'ctl*' service.restart glance-api
    

Configure Neutron OVS

After deploying an OpenStack environment with Neutron Open vSwitch, you may want to enable some of the additional features and configurations that Neutron provides.

This section describes how to enable and operate supported Neutron features.

Configure Neutron Quality of Service (QoS)

Neutron Quality of Service, or QoS, is a Neutron feature that enables OpenStack administrators to limit and prioritize network traffic through a set of policies for better network bandwidth.

MCP supports QoS policies with the following limitations:

  • Bandwidth limit for SR-IOV must be specified in Megabits per second (Mbps) and be dividable by 1000 Kilobits per second (Kbps).

    All values lower than 1000 KB per second are rounded up to 1 MB per second. Since float numbers are not supported, all values that cannot be divided by 1000 Kbps chunks are rounded up to the nearest integer Mbps value.

  • QoS rules are supported for the egress traffic only.

  • The network interface driver must support minimum transmit bandwidth (min_tx_rate).

    Minimum transmit bandwidth is supported by such drivers as QLogic 10 Gigabit Ethernet Driver (qlcnic), BNXT Poll Mode driver (bnxt), and so on. The Intel Linux ixgbe and i40e drivers do not support setting minimum transmit bandwidth.

  • No automatic oversubscription protection.

    Since the minimum transmit bandwidth is supported on the hypervisor level, your network is not protected from oversubscription. Total bandwidth on all ports may exceed maximum available bandwidth in the provider’s network.

This section describes how to configure Neutron Quality of Service.

Enable Neutron Quality of Service

By default, Neutron QoS is disabled. You can enable Neutron QoS before or after deploying an OpenStack environment.

To enable Neutron Quality of Service

  1. Log in to the Salt Master node.

  2. Open the cluster.<cluster-name>.openstack.init.yml file for editing.

  3. Set the neutron_qos_enabled parameter to True.

    parameters:
      _param:
        neutron_enable_qos: True
        ...
    

    This turns on the QoS functionality. Depending on the deployment, the command uploads extensions for the openvswitch or/and sriovnicswitch agents.

  4. Re-run Salt configuration on the Salt Master node:

    salt -C 'I@neutron:server' state.sls neutron
    salt -C 'I@neutron:gateway' state.sls neutron
    salt -C 'I@neutron:compute' state.sls neutron
    
  5. Proceed to Create a QoS policy.

Create a QoS policy

After you enable the Neutron Quality of Service feature, configure a QoS policy to prioritize one type of traffic over the other. This section describes basic operations. For more information, see: OpenStack documentation.

To create a QoS policy:

  1. Log in to an OpenStack controller node.

  2. Create a QoS policy:

    neutron qos-policy-create bw-limiter
    
  3. Add a rule to the QoSpolicy:

    neutron qos-bandwidth-limit-rule-create bw-limiter --max-kbps 3000000
    
  4. Apply the QoS policy:

    • To a new network:

      neutron net-create <network-name> --qos-policy bw-limiter
      
    • To an existing network:

      neutron net-update test --qos-policy bw-limiter
      
    • To a new port:

      neutron port-create test --name sriov_port --binding:vnic_type direct
      --qos-policy bw-limiter
      
    • To an existing port:

      neutron port-update sriov_port --qos-policy bw-limiter
      
Applying changes to a QoS policy

You can update or remove an existing QoS policy.

To update a QoS policy:

  1. Log in to an OpenStack controller node.

  2. Update the QoS policy using the neutron qos-bandwidth-limit-rule-update <qos-policy-name> command.

    Example:

    rule_id=`neutron qos-bandwidth-limit-rule-list bw-limiter -f value -c id`
    neutron qos-bandwidth-limit-rule-update bw-limiter $rule_id --max-kbps 200000
    

To remove a QoS policy:

  1. Log in to an OpenStack controller node.

  2. Remove from:

    • A network:

      neutron net-update <network-name> --no-qos-policy
      
    • A port:

      neutron port-update <sriov_port> --no-qos-policy
      

Enable network trunking

The Mirantis Cloud Platform supports port trunking which enables you to attach a virtual machine to multiple Neutron networks using VLANs as a local encapsulation to differentiate traffic for each network as it goes in and out of a single virtual machine network interface (VIF).

Using network trunking is particularly beneficial in the following use cases:

  • Some applications require connection to hundreds of Neutron networks. To achieve this, you may want to use a single or a few VIFs and VLANs to differentiate traffic for each network rather than having hundreds of VIFs per VM.

  • Cloud workloads are often very dynamic. You may prefer to add or remove VLANs rather than to hotplug interfaces in a virtual machine.

  • Moving a virtual machine from one network to another without detaching the VIF from the virtual machine.

  • A virtual machine may run many containers. Each container may have requirements to be connected to different Neutron networks. Assigning a VLAN or other encapsulation ID for each container is more efficient and scalable than requiring a vNIC per container.

  • Some legacy applications that require VLANs to connect to multiple networks.

The current limitation of network trunking support is that MCP supports only Neutron OVS with DPDK and the Open vSwitch firewall driver enabled. Other Neutron ML2 plugins, such as Linux Bridge and OVN, are not supported. If you use security groups and network trunking, MCP automatically enables the native Open vSwitch firewall driver.

To enable network trunking:

  1. Log in to the Salt Master node.

  2. Open the cluster.<NAME>.openstack.init.yml file for editing.

  3. Set the neutron_enable_vlan_aware_vms parameter to True:

    parameters:
      _param:
        neutron_enable_vlan_aware_vms: True
        ...
    
  4. Re-run Salt configuration:

    salt -C 'I@neutron:server' state.sls neutron
    salt -C 'I@neutron:gateway' state.sls neutron
    salt -C 'I@neutron:compute' state.sls neutron
    

Enable L2 Gateway support

The L2 Gateway (L2GW) plugin for the Neutron service provides the ability to interconnect a given tenant network with a VLAN on a physical switch. The basic components of L2GW include:

  • L2GW Service plugin

    Residing on a controller node, the L2GW Service plugin notifies the L2GW agent and normal L2 OVS agents running on compute hosts about network events and distributes the VTEP IP address information between them.

  • L2GW agent

    Running on a network node, the L2GW agent is responsible for connecting to OVSDB server running on a hardware switch and updating the database based on instructions received from the L2GW service plugin.

Before you proceed with the L2GW enablement, verify that the following requirements are met:

  • OVSDB Hardware VTEP physical switch enabled

  • L2 population mechanism driver enabled

To enable L2GW support:

  1. Log in to the Salt Master node.

  2. In the classes/cluster/<cluster_name>/openstack/control.yml file of your Reclass model, configure the OpenStack controller nodes by including the service.neutron.control.services.l2gw class.

  3. In the classes/cluster/<cluster_name>/openstack/gateway.yml file of your Reclass model, add the Neutron L2GW agent configuration. For example:

    neutron:
      gateway:
        l2gw:
          enabled: true
          debug: true
          ovsdb_hosts:
            ovsdb1: 10.164.5.253:6622
            ovsdb2: 10.164.5.254:6622
    

    Note

    ovsdb{1,2}

    User-defined identifier of a physical switch, which is a name that will be used in the OpenStack database to identify this switch.

  4. Apply the neutron state to the server nodes to install the service plugin packages, enable the L2GW service plugin, and update the Neutron database with the new schema:

    salt -I 'neutron:server' state.sls neutron -b 1
    
  5. Apply the neutron state to the gateway nodes to install the L2GW agent packages and configure the OVSDB parameters that include a switch pointer with the IP address and port:

    salt -I 'neutron:gateway' state.sls neutron
    
  6. Verify that the L2GW Neutron service plugin is enabled in your deployment:

    1. Log in to one of the OpenStack controller nodes.

    2. Verify that the following command is executed without errors:

      neutron l2-gateway-list
      

Configure BGP VPN

The Mirantis Cloud Platform (MCP) supports the Neutron Border Gateway Protocol (BGP) VPN Interconnection service. The BGP-based IP VPNs are commonly used in the industry, mainly for enterprises.

You can use the BGP VPN Interconnection service in the following typical use case: a tenant has a BGP IP VPN (a set of external sites) already set up outside the data center and wants to be able to trigger the establishment of a connection between VMs and these VPN external sites.

Enable the BGP VPN Interconnection service

If you have an existing BGP IP VPN (a set of external sites) set up outside the data center, you can enable the BGP VPN Interconnection service in MCP to be able to trigger the establishment of connectivity between VMs and these VPN external sites.

The drivers for the BGP VPN Interconnection service include:

  • OVS/BaGPipe driver

  • OpenContrail driver

  • OpenDaylight driver

To enable the BGP VPN Interconnection service:

  1. Log in to the Salt Master node.

  2. Open the cluster.<cluster_name>.openstack.init.yml file for editing.

  3. Set the neutron_enable_bgp_vpn parameter to True.

  4. Set the driver neutron_bgp_vpn_driver parameter to one of the following values: bagpipe, opendaylight, opencontrail. For example:

    parameters:
      _param:
        neutron_enable_bgp_vpn: True
        neutron_bgp_vpn_driver: bagpipe
        ...
    
  5. Re-apply the Salt configuration:

    salt -C 'I@neutron:server' state.sls neutron
    

For the OpenContrail and OpenDaylight drivers, we assume that the related SDN controllers are already enabled in your MCP cluster. To configure the BaGPipe driver, see Configure the BaGPipe driver for BGP VPN.

Configure the BaGPipe driver for BGP VPN

The BaGPipe driver is a lightweight implementation of the BGP-based VPNs used as a reference back end for the Neutron BGP VPN Interconnection service.

For the instruction below, we assume that the Neutron BGP VPN Interconnection service is already enabled on the OpenStack controller nodes. To enable BGP VPN, see Enable the BGP VPN Interconnection service.

To configure the BaGPipe driver:

  1. Log in to the Salt Master node.

  2. Open the cluster.<cluster_name>.openstack.compute.yml file for editing.

  3. Add the following parameters:

    parameters:
      ...
      neutron:
        compute:
          bgp_vpn:
            enabled: True
            driver: bagpipe
            bagpipe:
              local_address: <IP address used for BGP peerings>
              peers: <IP addresses of BGP peers>
              autonomous_system: <BGP Autonomous System Number>
              enable_rtc: True # Enable RT Constraint (RFC4684)
          backend:
            extension:
              bagpipe_bgpvpn:
                enabled: True
    
  4. Re-apply the Salt configuration:

    salt -C 'I@neutron:compute' state.sls neutron
    

Note

If BaGPipe is to be enabled on several compute nodes, set up Route Reflector to interconnect those BagPipe instances. For more information, see BGP and Route Reflection.

Enable the Networking NW-ODL ML2 plugin

This section explains how to enable the Networking OpenDaylight (NW-ODL) Modular Layer 2 (ML2) plugin for Neutron in your deployment using the Neutron Salt formula, which can install the networking-odl package and enables Neutron to connect to the OpenDaylight controller.

Note

The procedure assumes that the OpenDaylight controller is already up and running.

To enable the NW-ODL ML2 plugin:

  1. Log in to the Salt Master node.

  2. Define the OpenDaylight plugin options in the cluster/<cluster_name>/openstack/init.yml file as follows:

    _param:
      opendaylight_service_host: <ODL_controller_IP>
      opendaylight_router: odl-router_v2 # default
      opendaylight_driver: opendaylight_v2 # default
      provider_mappings: physnet1:br-floating # default
    
  3. In the cluster/<cluster_name>/openstack/control.yml file of your Reclass model, configure the Neutron server by including the system.neutron.control.opendaylight.cluster class and setting credentials and port of OpenDaylight REST API. For example:

    classes
    - system.neutron.control.opendaylight.cluster
    parameters:
      neutron:
        server:
          backend:
            rest_api_port: 8282
            user: admin
            password: admin
    
  4. In the cluster/<cluster_name>/openstack/gateway.yml file of your Reclass model, include the following class:

    classes
    - service.neutron.gateway.opendaylight.single
    
  5. In the classes/cluster/<cluster_name>/openstack/compute.yml file of your Reclass model, include the following class:

    classes
    - service.neutron.compute.opendaylight.single
    
  6. Apply the configuration changes by executing the neutron state on all nodes with neutron:server, neutron:gateway, and neutron:compute roles:

    salt -I 'neutron:server' state.sls neutron
    salt -I 'neutron:gateway' state.sls neutron
    salt -I 'neutron:compute' state.sls neutron
    

Enable Cross-AZ high availability for Neutron agents

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

The Mirantis Cloud Platform (MCP) supports HA with availability zones for Neutron. An availability zone is defined as an agent attribute on a network node and groups network nodes that run services like DHCP, L3, and others. You can associate an availability zone with resources to ensure the resources become HA.

Availability zones provide an extra layer of protection by segmenting the Neutron service deployment in isolated failure domains. By deploying HA nodes across different availability zones, the network services remain available in case of zone-wide failures that affect the deployment. For details, see OpenStack documentation.

Enable Cross-AZ high availability for DHCP

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable Cross-AZ high availability for DHCP. As a result, DHCP services will be created in availability zones selected during the network creation.

To enable Cross-AZ high availability for DHCP:

  1. Log in to the Salt Master node.

  2. In /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/control.yml, set the following parameters:

    parameters:
      neutron:
        server:
          dhcp_agents_per_network: '2'
          dhcp_load_type: 'networks'
          network_scheduler_driver: neutron.scheduler.dhcp_agent_scheduler.AZAwareWeightScheduler
    
  3. In /srv/salt/reclass/classes/cluster/<cluster_name>/infra/config/nodes.yml, set the availability_zone parameter for each network or gateway node as required:

    parameters:
      reclass:
        storage:
          node:
            ...
            openstack_gateway_node<id>:
              parameters:
                neutron:
                  gateway:
                    availability_zone: <az-name>
            openstack_gateway_node<id+1>:
              parameters:
                neutron:
                  gateway:
                    availability_zone: <az-name>
           ...
    
  4. Apply the changes:

    salt -C 'I@salt:master' state.sls reclass.storage
    salt -C 'I@neutron:server or I@neutron:gateway' saltutil.refresh_pillar
    salt -C 'I@neutron:server or I@neutron:gateway' state.sls neutron
    
Enable Cross-AZ high availability for L3 router

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable Cross-AZ high availability for an L3 router. As a result, the L3 router services will be created in availability zones selected during the router creation.

To enable Cross-AZ high availability for L3 routers:

  1. Log in to the Salt Master node.

  2. In /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/control.yml, set the following parameters:

    parameters:
      neutron:
        server:
          router_scheduler_driver: neutron.scheduler.l3_agent_scheduler.AZLeastRoutersScheduler
          max_l3_agents_per_router: '3'
    
  3. In /srv/salt/reclass/classes/cluster/<cluster_name>/infra/config/nodes.yml, set the availability_zone parameter for each network or gateway node as required:

    parameters:
      reclass:
        storage:
          node:
            ...
            openstack_gateway_node<id>:
              parameters:
               neutron:
                  gateway:
                    availability_zone: <az-name>
            openstack_gateway_node<id+1>:
              parameters:
                neutron:
                  gateway:
                    availability_zone: <az-name>
           ...
    
  4. Apply the changes:

    salt -C 'I@salt:master' state.sls reclass.storage
    salt -C 'I@neutron:server or I@neutron:gateway' saltutil.refresh_pillar
    salt -C 'I@neutron:server or I@neutron:gateway' state.sls neutron
    

Ironic operations

Ironic is an Administrators only service allowing access to all API requests only to the OpenStack users with the admin or baremetal_admin roles. However, some read-only operations are also available to the users with the baremetal_observer role.

In MCP, Ironic has not been integrated with the OpenStack Dashboard service yet. To manage and use Ironic, perform any required actions either through the Bare Metal service command-line client using the ironic or openstack baremetal commands, from scripts using the ironicclient Python API, or through direct REST API interactions.

Managing and using Ironic include creating suitable images, enrolling bare metal nodes into Ironic and configuring them appropriately, and adding compute flavors that correspond to the available bare metal nodes.

Prepare images for Ironic

To provision bare metal servers using Ironic, you need to create special images and upload them to Glance.

The configuration of images much depends on an actual hardware. Therefore, they cannot be provided as pre-built images, and you must prepare them after you deploy Ironic.

These images include:

  • Deploy image that runs the ironic-python-agent required for the deployment and control of bare metal nodes

  • User image based on the hardware used in your non-virtualized environment

Note

This section explains how to create the required images using the diskimage-builder tool.

Prepare deploy images

A deploy image is the image that the bare metal node is PXE-booted into during the image provisioning or node cleaning. It resides in the node’s RAM and has a special agent running that the ironic-conductor service communicates with to orchestrate the image provisioning and node cleaning.

Such images must contain drivers for all network interfaces and disks of the bare metal server.

Note

This section provides example instructions on how to prepare the required images using the diskimage-builder tool. The steps may differ depending on your specific needs and the builder tool. For more information, see Building or downloading a deploy ramdisk image.

To prepare deploy images:

  1. Create the required image by typing:

    diskimage-create <BASE-OS> ironic-agent
    
  2. Upload the resulting *.kernel and *.initramfs images to Glance as aki and ari images:

    1. To upload an aki image, type:

      glance image-create --name <IMAGE_NAME> \
           --disk-format aki \
           --container-format aki \
           --file <PATH_TO_IMAGE_KERNEL>
      
    2. To upload an ari image, type:

      glance image-create --name <IMAGE_NAME> \
           --disk-format ari \
           --container-format ari \
           --file <PATH_TO_IMAGE_INITRAMFS>
      
Prepare user images

Ironic understands two types of user images that include:

  • Whole disk image

    Image of complete operating system with the partition table and partitions

  • Partition image

    Image of root partition only, without the partition table. Such images must have appropriate kernel and initramfs images associated with them.

    The partition images can be deployed using one of the following methods:

    • netboot (default)

      The node is PXE-booted over network to kernel and ramdisk over TFTP.

    • local boot

      During a deployment, the image is modified on a disk to boot from a local disk. See Ironic Advanced features for details.

User images are deployed in a non-virtualized environment on real hardware servers. Therefore, they require all necessary drivers for a given bare metal server hardware to be included, that are disks, NICs, and so on.

Note

This section provides example instructions on how to prepare the required images using the diskimage-builder tool. The steps may differ depending on your specific needs and the builder tool. For more information, see the Create and add images to the Image service.

To prepare whole disk images:

Use standard cloud images as whole disk images if they contain all necessary drivers. Otherwise, rebuild a cloud image by typing:

diskimage-create <base system> -p <EXTRA_PACKAGE_TO_INSTALL> [-p ..]

To prepare partition images for netboot:

  1. Use the images from UEC cloud images that have kernel and initramfs as separate images if they contain all the required drivers.

  2. If additional drivers are required, rebuild the standard whole disk cloud image adding the packages as follows:

    diskimage-create <BASE_SYSTEM>> baremetal -p <EXTRA_PACKAGE_TO_INSTALL> [-p ..]
    
  3. Upload images to Glance in the following formats:

    • For an aki image for kernel, type:

      glance image-create --name <IMAGE_NAME> \
           --disk-format aki \
           --container-format aki \
           --file <PATH_TO_IMAGE_KERNEL>
      
    • For an ari image for initramfs, type:

      glance image-create --name <IMAGE_NAME> \
           --disk-format ari \
           --container-format ari \
           --file <PATH_TO_IMAGE_INITRAMFS>
      
    • For a rootfs or whole disk image in the output format (qcow2 by default) specified during rebuild, type:

      glance image-create --name <IMAGE_NAME> \
           --disk-format <'QCOW2'_FROM_THE_ABOVE_COMMAND> \
           --container-format <'BARE'_FROM_THE_ABOVE_COMMAND> \
           --kernel-id <UUID_OF_UPLOADED_AKI_IMAGE> \
           --ramdisk-id <UUID_OF_UPLOADED_ARI_IMAGE> \
           --file <PATH_TO_IMAGE_KERNEL>
      

      Note

      For rootfs images, set the kernel_id and ramdisk_id image properties to UUIDs of the uploaded aki and ari images respectively.

To prepare partition images for local boot:

  1. Use the images from UEC cloud images that have kernel and initramfs as separate images if they contain all the required drivers.

  2. If additional drivers are required, rebuild the standard whole disk cloud image adding the packages as follows:

    Caution

    Verify that the base operating system has the grub2 package available for installation. And enable it during the rebuild as illustrated in the command below.

    diskimage-create <BASE_SYSTEM> baremetal grub2 -p <EXTRA_PACKAGE_TO_INSTALL> [-p ..]
    

Add bare metal nodes

This section describes the main steps to enroll a bare metal node to Ironic and make it available for provisioning.

To enroll and configure bare metal nodes:

  1. Enroll new nodes to Ironic using the ironic node-create command:

    ironic node-create \
        --name <node-name> \
        --driver <driver-name> \
        --driver-info deploy_ramdisk=<glance UUID of deploy image ramdisk> \
        --driver-info deploy_kernel=<glance UUID of deploy image kernel> \
        --driver-info ipmi_address=<IPMI address of the node> \
        --driver-info ipmi_username=<username for IPMI> \
        --driver-info ipmi_password=<password for the IPMI user> \
        --property memory_mb=<RAM size of the node in MiB> \
        --property cpus=<Number of CPUs on the node> \
        --property local_gb=<size of node's disk in GiB> \
        --property cpu_arch=<architecture of node's CPU>
    

    Where the local_gb property is the size of the biggest disk of the node. We recommend setting it to a 1 GB smaller size than the actual size to accommodate for the partitions table to be created and the extra configuration drive partition.

  2. Add ports for the node that correspond to the actual NICs of the node:

    ironic port-create --node <UUID_OF_IRONIC_NODE> --address <MAC_ADDRESS>
    

    Note

    At least one port for the node must be created for the NIC that is attached to the provisioning network and from which the node can boot over PXE.

  3. Alternatively, enroll the nodes by adding them to the Reclass model on the cluster level:

    parameters:
      ironic:
        client:
          enabled: true
          nodes:
            admin_identity:
              - name: <node-name>
                driver: pxe_ipmitool
                properties:
                  local_gb: <size of node's disk in GiB>
                  cpus: <Number of CPUs on the node>
                  memory_mb: <RAM size of the node in MiB>
                  cpu_arch: <architecture of node's CPU>
                driver_info:
                  ipmi_username: <username for IPMI>
                  ipmi_password: <password for the IPMI user>
                  ipmi_address: <IPMI address of the node>
                ports:
                  - address: <MAC address of the node port1>
                  - address: <MAC address of the node port2>
    

Create compute flavors

The appropriately created compute flavors allows for proper compute service scheduling of workloads to bare metal nodes.

To create nova flavors:

  1. Create a flavor using the nova flavor-create command:

    nova flavor-create <FLAVOR_NAME> <UUID_OR_'auto'> <RAM> <DISK> <CPUS>
    

    Where RAM, DISK, and CPUS equal to the corresponding properties set on the bare metal nodes.

  2. Use the above command to create flavors for each type of bare metal nodes you need to differentiate.

Provision instances

After Ironic nodes, ports, and flavors have been successfully configured, deploy the nova-compute instances to the bare metal nodes using the nova boot command:

nova boot <server name> \
    --image <IMAGE_NAME_OR_ID> \
    --flavor <BAREMETAL_FLAVOR_NAME_OR_ID> \
    --nic net-id=<ID_OF_SHARED_BAREMETAL_NETWORK>

Enable SSL on Ironic internal API

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can enable SSL for all OpenStack components while generating a deployment metadata model using the Model Designer UI before deploying a new OpenStack environment. You can also enable SSL on Ironic internal API on an existing OpenStack environment.

The example instruction below describes the following Ironic configuration:

  • The OpenStack Ironic API service runs on the OpenStack ctl nodes.

  • The OpenStack Ironic deploy API and conductor services run on the bmt nodes.

You may need to modify this example configuration depending on the needs of your deployment.

To enable SSL on Ironic internal API on an existing MCP cluster:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. Modify ./openstack/baremetal.yml as follows:

    classes:
    - system.salt.minion.cert.openstack_api
    - system.apache.server.proxy
    - system.apache.server.proxy.openstack.ironic
    parameters:
      _param:
        apache_proxy_openstack_api_address: ${_param:cluster_baremetal_local_address}
        apache_proxy_openstack_api_host: ${_param:cluster_baremetal_local_address}
        ironic_conductor_api_url_protocol: https
        openstack_api_cert_alternative_names: IP:127.0.0.1,IP:${_param:cluster_baremetal_local_address},IP:${_param:cluster_baremetal_vip_address},DNS:${linux:system:name},DNS:${linux:network:fqdn},DNS:$  {_param:cluster_baremetal_local_address},DNS:${_param:cluster_baremetal_vip_address}
        apache_ssl:
          enabled: true
          authority: "${_param:salt_minion_ca_authority}"
          key_file: ${_param:openstack_api_cert_key_file}
          cert_file: ${_param:openstack_api_cert_cert_file}
          chain_file: ${_param:openstack_api_cert_all_file}
        apache_proxy_openstack_ironic_host: 127.0.0.1
        haproxy_https_check_options:
        - httpchk GET /
        - httpclose
        - tcplog
        haproxy_ironic_deploy_check_params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3 check-ssl verify none
      haproxy:
        proxy:
          listen:
            ironic_deploy:
              type: None
              mode: tcp
              options: ${_param:haproxy_https_check_options}
      ironic:
        api:
          bind:
            address: 127.0.0.1
    
  3. Modify ./openstack/control.yml as follows:

    classes:
    - system.apache.server.proxy.openstack.ironic
    parameters:
      _param:
        apache_proxy_openstack_ironic_host: 127.0.0.1
        haproxy_ironic_check_params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3 check-ssl verify none
      haproxy:
        proxy:
          listen:
            ironic:
              type: None
              mode: tcp
              options: ${_param:haproxy_https_check_options}
      ironic:
        api:
          bind:
            address: 127.0.0.1
    
  4. Modify ./openstack/control/init.yml as follows:

    parameters:
      _param:
        ironic_service_protocol: ${_param:cluster_internal_protocol}
    
  5. Modify ./openstack/init.yml as follows:

    parameters:
      _param:
        ironic_service_host: ${_param:openstack_service_host}
        ironic_service_protocol: ${_param:cluster_internal_protocol}
    
  6. Modify ./openstack/proxy.yml as follows:

    parameters:
      _param:
        nginx_proxy_openstack_ironic_protocol: https
    
  7. Refresh pillars:

    salt '*' saltutil.refresh_pillar
    
  8. Apply the following Salt states:

    salt 'bmt*' state.apply salt
    salt -C 'I@ironic:api' state.apply apache
    salt 'prx*' state.apply nginx
    salt -C 'I@ironic:api' state.apply haproxy
    salt -C 'I@ironic:api' state.apply ironic
    

Enable the networking-generic-switch driver

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Note

This feature is available as technical preview. Use such configuration for testing and evaluation purposes only.

The networking-generic-switch ML2 mechanism driver in Neutron implements the features required for multitenancy support on the Ironic bare metal nodes. This driver requires the corresponding configuration of the Neutron server service.

To enable the networking-generic-switch driver:

  1. Log in to the Salt Master node.

  2. Open the cluster level of your deployment model.

  3. In openstack/control.yml, add pillars for networking-generic-switch using the example below:

    parameters:
      ...
      neutron:
        server:
          backend:
            mechanism:
              ngs:
                driver: genericswitch
          n_g_s:
            enabled: true
            coordination: # optional
              enabled: true
              backend_url: "etcd3+http://1.2.3.4:2379"
            devices:
              s1brbm:
                options:
                  device_type:
                    value: netmiko_ovs_linux
                  ip:
                    value: 1.2.3.4
                  username:
                    value: ngs_ovs_manager
                  password:
                    value: password
    
  4. Apply the new configuration for the Neutron server:

    salt -C 'I@neutron:server' saltutil.refresh_pillar
    salt -C 'I@neutron:server' state.apply neutron.server
    

Troubleshoot Ironic

The most possible and typical failures of Ironic are caused by the following peculiarities of the service design:

  • Ironic is sensitive to possible time difference between the nodes that host the ironic-api and ironic-conductor services.

    One of the symptoms of time being out of sync is inability to enroll a bare metal node into Ironic with the error message No conductor service registered which supports driver <DRIVER_NAME> Although, the DRIVER_NAME driver is known to be enabled and is shown in the output of the ironic driver-list command.

    To fix the issue, verify that the time is properly synced between the nodes.

  • Ironic requires IPMI access credentials for the nodes to have the admin privilege level. Any lower privilege level, for example, engineer precludes Ironic from functioning properly.

Designate operations

After you deploy an MCP cluster that includes Designate, you can start creating DNS zones and zone records as well as configure auto-generation of records in DNS zones.

Create a DNS zone and record

This section describes how to create a DNS zone and a record in the created DNS zone on the MCP cluster where Designate is deployed.

To create a DNS zone and record:

  1. Log in to the Salt Master node.

  2. Create a test DNS zone called testdomain.tld. by running the following command against one of the controller nodes where Designate is deployed. For example, ctl01.

    salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack zone create \
    --email dnsmaster@testdomain.tld testdomain.tld."
    

    Once the change is applied to one controller node, the updated distributed database replicates this change between all controller nodes.

    Example of system response:

    ctl01.virtual-mcp-ocata-ovs.local:
     +----------------+--------------------------------------+
     | Field          | Value                                |
     +----------------+--------------------------------------+
     | action         | CREATE                               |
     | attributes     |                                      |
     | created_at     | 2017-08-01T12:25:33.000000           |
     | description    | None                                 |
     | email          | dnsmaster@testdomain.tld             |
     | id             | ce9836a9-ba78-4960-9c89-6a4989a9e095 |
     | masters        |                                      |
     | name           | testdomain.tld.                      |
     | pool_id        | 794ccc2c-d751-44fe-b57f-8894c9f5c842 |
     | project_id     | 49c11a3aa9534d8b897cf06890871840     |
     | serial         | 1501590333                           |
     | status         | PENDING                              |
     | transferred_at | None                                 |
     | ttl            | 3600                                 |
     | type           | PRIMARY                              |
     | updated_at     | None                                 |
     | version        | 1                                    |
     +----------------+--------------------------------------+
    
  3. Verify that a DNS zone is successfully created and is in the ACTIVE status:

    salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack zone list"
    

    Example of system response:

    ctl01.virtual-mcp-ocata-ovs.local:
     +------------------------------------+---------------+-------+-----------+------+------+
     |id                                  |name           |type   |serial     |status|action|
     +------------------------------------+---------------+-------+-----------+------+------+
     |571243e5-17dd-49bd-af09-de6b0c175d8c|example.tld.   |PRIMARY| 1497877051|ACTIVE|NONE  |
     |7043de84-3a40-4b44-ad4c-94dd1e802370|domain.tld.    |PRIMARY| 1498209223|ACTIVE|NONE  |
     |ce9836a9-ba78-4960-9c89-6a4989a9e095|testdomain.tld.|PRIMARY| 1501590333|ACTIVE|NONE  |
     +------------------------------------+---------------+-------+-----------+------+------+
    
  4. Create a record in the new DNS zone by running the command below. Use any IPv4 address to test that it works. For example, 192.168.0.1.

    salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack recordset create \
    --records '192.168.0.1' --type A testdomain.tld. tstserver01"
    

    Example of system response:

    ctl01.virtual-mcp-ocata-ovs.local:
     +-------------+--------------------------------------+
     | Field       | Value                                |
     +-------------+--------------------------------------+
     | action      | CREATE                               |
     | created_at  | 2017-08-01T12:28:37.000000           |
     | description | None                                 |
     | id          | d099f013-460b-41ee-8cf1-3cf0e3c49bc7 |
     | name        | tstserver01.testdomain.tld.         |
     | project_id  | 49c11a3aa9534d8b897cf06890871840     |
     | records     | 192.168.0.1                          |
     | status      | PENDING                              |
     | ttl         | None                                 |
     | type        | A                                    |
     | updated_at  | None                                 |
     | version     | 1                                    |
     | zone_id     | ce9836a9-ba78-4960-9c89-6a4989a9e095 |
     | zone_name   | testdomain.tld.                      |
     +-------------+--------------------------------------+
    
  5. Verify that the record is successfully created and is in the ACTIVE status by running the openstack recordset list [zone_id] command. The zone_id parameter can be found in the output of the command described in the previous step.

    Example:

    salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack recordset list \
    ce9836a9-ba78-4960-9c89-6a4989a9e095"
    
    ctl01.virtual-mcp-ocata-ovs.local:
    +---+---------------------------+----+----------------------------------------------------------+------+------+
    | id| name                      |type|records                                                   |status|action|
    +---+---------------------------+----+----------------------------------------------------------+------+------+
    |...|testdomain.tld.            |SOA |ns1.example.org. dnsmaster.testdomain.tld. 1501590517 3598|ACTIVE|NONE  |
    |...|testdomain.tld.            |NS  |ns1.example.org.                                          |ACTIVE|NONE  |
    |...|tstserver01.testdomain.tld.|A   |192.168.0.1                                               |ACTIVE|NONE  |
    +---+---------------------------+----+----------------------------------------------------------+------+------+
    
  6. Verify that the DNS record can be resolved by running the nslookup tstserver01.domain.tld [dns server address] command. In the example below, the DNS server address of the Designate back end is 10.0.0.1.

    Example:

    nslookup tstserver01.testdomain.tld 10.0.0.1
    
    Server:     10.0.0.1
    Address:    10.0.0.1#53
    Name:   tstserver01.testdomain.tld
    Address: 192.168.0.1
    

Configure auto-generation of records in a DNS zone

After you create a DNS zone and a record for this zone as described in Create a DNS zone and record, you can configure auto-generation of records in the created DNS zone.

To configure auto-generation of records in the created DNS zone:

  1. In your Git project repository, change the directory to classes/cluster/<cluster_name>/openstack/.

  2. In init.yml, set the designate_domain_id parameter according to the created DNS zone. For example:

    designate_domain_id: ce9836a9-ba78-4960-9c89-6a4989a9e095
    
  3. Refresh pillars on the Salt Minion nodes:

    salt '*' saltutil.pillar_refresh
    
  4. Apply the Designate states:

    salt -C 'I@designate:server and *01*' state.sls designate.server
    salt -C 'I@designate:server' state.sls designate
    
  5. Using the Nova CLI, boot the VM which you have created a DNS zone for.

  6. Verify that the DNS record related to the VM was created by running the salt 'ctl01*' cmd.run "openstack recordset list [zone_id]" command. For example:

    salt 'ctl01*' cmd.run ". /root/keystonercv3; openstack recordset list \
    ce9836a9-ba78-4960-9c89-6a4989a9e095"
    

    Example of system response:

    ctl01.virtual-mcp-ocata-ovs.local:
    +------------------------------------+---------------------------+----+-----------+------+------+
    |id                                  |name                       |type|records    |status|action|
    +------------------------------------+---------------------------+----+-----------+------+------+
    |d099f013-460b-41ee-8cf1-3cf0e3c49bc7|tstserver01.testdomain.tld.|A   |192.168.0.1|ACTIVE|NONE  |
    +------------------------------------+---------------------------+----+-----------+------+------+
    

Ceph operations

Ceph is a storage back end for cloud environments. After you successfully deploy a Ceph cluster, you can manage its nodes and object storage daemons (Ceph OSDs). This section describes how to add Ceph Monitor, Ceph OSD, and RADOS Gateway nodes to an existing Ceph cluster or remove them, as well as how to remove or replace Ceph OSDs, and migrate the Ceph back end from Filestore to Bluestore and vice versa.

Prerequisites

Before you proceed to manage Ceph nodes and OSDs, or upgrade Ceph, perform the steps below.

  1. Verify that your Ceph cluster is up and running.

  2. Log in to the Salt Master node.

  3. Add Ceph pipelines to DriveTrain.

    1. Add the following class to the cluster/cicd/control/leader.yml file:

      classes:
      - system.jenkins.client.job.ceph
      
    2. Apply the salt -C 'I@jenkins:client' state.sls jenkins.client state.

Manage Ceph nodes

This section describes how to add Ceph Monitor, Ceph OSD, and RADOS Gateway nodes to an existing Ceph cluster or remove them.

Add a Ceph Monitor node

This section describes how to add a Ceph Monitor node to an existing Ceph cluster.

Warning

Prior to the 2019.2.10 maintenance update, this feature is available as technical preview only.

Note

The Ceph Monitor service is quorum-based. Therefore, keep an odd number of Ceph Monitor nodes to establish a quorum.

To add a Ceph Monitor node:

  1. In your project repository, add the following lines to the cluster/ceph/init.yml file and modify them according to your environment:

    _param:
       ceph_mon_node04_hostname: cmn04
       ceph_mon_node04_address: 10.13.0.4
       ceph_mon_node04_ceph_public_address: 172.16.47.145
    linux:
      network:
        host:
          cmn04:
            address: ${_param:ceph_mon_node04_address}
            names:
            - ${_param:ceph_mon_node04_hostname}
            - ${_param:ceph_mon_node04_hostname}.${_param:cluster_domain}
    
  2. Define the backup configuration for the new node in cluster/ceph/init.yml. For example:

    parameters:
      _param:
        ceph_mon_node04_ceph_backup_hour: 4
        ceph_mon_node04_ceph_backup_minute: 0
    
  3. Add the following lines to the cluster/ceph/common.yml file and modify them according to your environment:

    parameters:
      ceph:
        common:
          members:
            - name: ${_param:ceph_mon_node04_hostname}
              host: ${_param:ceph_mon_node04_address}
    
  4. Add the following lines to the cluster/infra/config/nodes.yml file:

    parameters:
      reclass:
        storage:
          node:
            ceph_mon_node04:
              name: ${_param:ceph_mon_node04_hostname}
              domain: ${_param:cluster_domain}
              classes:
              - cluster.${_param:cluster_name}.ceph.mon
              params:
                ceph_public_address: ${_param:ceph_mon_node04_ceph_public_address}
                ceph_backup_time_hour: ${_param:ceph_mon_node04_ceph_backup_hour}
                ceph_backup_time_minute: ${_param:ceph_mon_node04_ceph_backup_minute}
                salt_master_host: ${_param:reclass_config_master}
                linux_system_codename: ${_param:ceph_mon_system_codename}
                single_address: ${_param:ceph_mon_node04_address}
                keepalived_vip_priority: 104
    
  5. Add the following lines to the cluster/infra/kvm.yml file and modify infra_kvm_node03_hostname depending on which KVM node the Ceph Monitor node should run on:

    parameters:
      salt:
        control:
          size:
            ceph.mon:
              cpu: 8
              ram: 16384
              disk_profile: small
              net_profile: default
          cluster:
            internal:
              node:
                cmn04:
                  name: ${_param:ceph_mon_node04_hostname}
                  provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
                  image: ${_param:salt_control_xenial_image}
                  size: ceph.mon
    
  6. Refresh pillars:

    salt '*' saltutil.refresh_pillar
    
  7. Log in to the Jenkins web UI.

  8. Open the Ceph - add node pipeline.

  9. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the new Ceph Monitor node. For example, cmn04*.

    HOST_TYPE

    Add mon as the type of Ceph node that is going to be added.

  10. Click Deploy.

The Ceph - add node pipeline workflow:

  1. Launch the Ceph Monitor VMs.

  2. Run the reclass state.

  3. Run the linux, openssh, salt, ntp, rsyslog, ceph.mon states.

  4. Update ceph.conf files on all Ceph nodes.

  5. Run the ceph.mgr state if the pillar is present.

Add a Ceph OSD node

This section describes how to add a Ceph OSD node to an existing Ceph cluster.

Warning

Prior to the 2019.2.10 maintenance update, this feature is available as technical preview only.

To add a Ceph OSD node:

  1. Connect the Ceph OSD salt-minion node to salt-master.

  2. In your project repository, if the nodes are not generated dynamically, add the following lines to cluster/ceph/init.yml and modify according to your environment:

    _param:
       ceph_osd_node05_hostname: osd005
       ceph_osd_node05_address: 172.16.47.72
       ceph_osd_system_codename: xenial
    linux:
      network:
        host:
          osd005:
            address: ${_param:ceph_osd_node05_address}
            names:
            - ${_param:ceph_osd_node05_hostname}
            - ${_param:ceph_osd_node05_hostname}.${_param:cluster_domain}
    
  3. If the nodes are not generated dynamically, add the following lines to the cluster/infra/config/init.yml and modify according to your environment. Otherwise, increase the number of generated OSDs.

    parameters:
      reclass:
        storage:
          node:
            ceph_osd_node05:
              name: ${_param:ceph_osd_node05_hostname}
              domain: ${_param:cluster_domain}
              classes:
              - cluster.${_param:cluster_name}.ceph.osd
              params:
                salt_master_host: ${_param:reclass_config_master}
                linux_system_codename:  ${_param:ceph_osd_system_codename}
                single_address: ${_param:ceph_osd_node05_address}
                ceph_crush_parent: rack02
    
  4. Since 2019.2.3, skip this step Verify that the cluster/ceph/osd.yml file and the pillar of the new Ceph OSD do not contain the following lines:

    parameters:
      ceph:
        osd:
          crush_update: false
    
  5. Log in to the Jenkins web UI.

  6. Select from the following options:

    • For MCP versions starting from the 2019.2.10 maintenance update, open the Ceph - add osd (upmap) pipeline.

    • For MCP versions prior to the 2019.2.10 maintenance update, open the Ceph - add node pipeline.

    Note

    Prior to the 2019.2.10 maintenance update, the Ceph - add node and Ceph - add osd (upmap) Jenkins pipeline jobs are available as technical preview only.

    Caution

    A large change in the crush weights distribution after the addition of Ceph OSDs can cause massive unexpected rebalancing, affect performance, and in some cases can cause data corruption. Therefore, if you are using Ceph - add node, Mirantis recommends that you add all disks with zero weight and reweight them gradually.

  7. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the new Ceph OSD. For example, osd005*.

    HOST_TYPE Removed since 2019.2.3 update

    Add osd as the type of Ceph node that is going to be added.

    CLUSTER_FLAGS Added since 2019.2.7 update

    Add a comma-separated list of flags to check after the pipeline execution.

  8. Click Deploy.

    The Ceph - add node pipeline workflow prior to the 2019.2.3 maintenance update:

    1. Apply the reclass state.

    2. Apply the linux, openssh, salt, ntp, rsyslog, ceph.osd states.

    The Ceph - add node pipeline workflow starting from 2019.2.3 maintenance update:

    1. Apply the reclass state.

    2. Verify that all installed Ceph clients have the Luminous version.

    3. Apply the linux, openssh, salt, ntp, rsyslog, states.

    4. Set the Ceph cluster compatibility to Luminous.

    5. Switch the balancer module to the upmap mode.

    6. Set the norebalance flag before adding a Ceph OSD.

    7. Apply the ceph.osd state on the selected Ceph OSD node.

    8. Update the mappings for the remapped placement group (PG) using upmap back to the old Ceph OSDs.

    9. Unset the norebalance flag and verify that the cluster is healthy.

  9. If you use a custom CRUSH map, update the CRUSH map:

    1. Verify the updated /etc/ceph/crushmap file on cmn01. If correct, apply the CRUSH map using the following commands:

      crushtool -c /etc/ceph/crushmap -o /etc/ceph/crushmap.compiled
      ceph osd setcrushmap -i /etc/ceph/crushmap.compiled
      
    2. Add the following lines to the cluster/ceph/osd.yml file:

      parameters:
        ceph:
          osd:
            crush_update: false
      
    3. Apply the ceph.osd state to persist the CRUSH map:

      salt -C 'I@ceph:osd' state.sls ceph.osd
      
  10. Integrate the Ceph OSD nodes with StackLight:

    1. Update the Salt mine:

      salt -C 'I@ceph:osd or I@telegraf:remote_agent' state.sls salt.minion.grains
      salt -C 'I@ceph:osd or I@telegraf:remote_agent' saltutil.refresh_modules
      salt -C 'I@ceph:osd or I@telegraf:remote_agent' mine.update
      

      Wait for one minute.

    2. Apply the following states:

      salt -C 'I@ceph:osd' state.sls telegraf
      salt -C 'I@ceph:osd' state.sls fluentd
      salt 'mon*' state.sls prometheus
      
Add a RADOS Gateway node

This section describes how to add a RADOS Gateway (rgw) node to an existing Ceph cluster.

To add a RADOS Gateway node:

  1. In your project repository, add the following lines to the cluster/ceph/init.yml and modify them according to your environment:

    _param:
      ceph_rgw_node04_hostname: rgw04
      ceph_rgw_node04_address: 172.16.47.162
    linux:
      network:
        host:
          rgw04:
            address: ${_param:ceph_rgw_node04_address}
            names:
            - ${_param:ceph_rgw_node04_hostname}
            - ${_param:ceph_rgw_node04_hostname}.${_param:cluster_domain}
    
  2. Add the following lines to the cluster/ceph/rgw.yml file:

    parameters:
      _param:
        cluster_node04_hostname: ${_param:ceph_rgw_node04_hostname}
        cluster_node04_address: ${_param:ceph_rgw_node04_address}
      ceph:
        common:
          keyring:
            rgw.rgw04:
              caps:
                mon: "allow rw"
                osd: "allow rwx"
      haproxy:
        proxy:
          listen:
            radosgw:
              servers:
                - name: ${_param:cluster_node04_hostname}
                  host: ${_param:cluster_node04_address}
                  port: ${_param:haproxy_radosgw_source_port}
                  params: check
    

    Note

    Starting from the MCP 2019.2.10 maintenance update, the capabilities for RADOS Gateway have been restricted. To update the existing capabilities, perform the steps described in Restrict the RADOS Gateway capabilities.

  3. Add the following lines to the cluster/infra/config/init.yml file:

    parameters:
      reclass:
        storage:
          node:
            ceph_rgw_node04:
              name: ${_param:ceph_rgw_node04_hostname}
              domain: ${_param:cluster_domain}
              classes:
              - cluster.${_param:cluster_name}.ceph.rgw
              params:
                salt_master_host: ${_param:reclass_config_master}
                linux_system_codename:  ${_param:ceph_rgw_system_codename}
                single_address: ${_param:ceph_rgw_node04_address}
                keepalived_vip_priority: 104
    
  4. Add the following lines to the cluster/infra/kvm.yml file and modify infra_kvm_node03_hostname depending on which KVM node the rgw must be running on:

    parameters:
      salt:
        control:
          size:
            ceph.rgw:
              cpu: 8
              ram: 16384
              disk_profile: small
              net_profile: default
          cluster:
            internal:
              node:
                rgw04:
                  name: ${_param:ceph_rgw_node04_hostname}
                  provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
                  image: ${_param:salt_control_xenial_image}
                  size: ceph.rgw
    
  5. Log in to the Jenkins web UI.

  6. Open the Ceph - add node pipeline.

  7. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the new RADOS Gateway node. For example, rgw04*.

    HOST_TYPE

    Add rgw as the type of Ceph node that is going to be added.

  8. Click Deploy.

The Ceph - add node pipeline workflow:

  1. Launch RADOS Gateway VMs.

  2. Run the reclass state.

  3. Run the linux, openssh, salt, ntp, rsyslog, keepalived, haproxy, ceph.radosgw states.

Remove a Ceph Monitor node

This section describes how to remove a Ceph Monitor node from a Ceph cluster.

Note

The Ceph Monitor service is quorum-based. Therefore, keep an odd number of Ceph Monitor nodes to establish a quorum.

To remove a Ceph Monitor node:

  1. In your project repository, remove the following lines from the cluster/infra/config/init.yml file or from the pillar based on your environment:

    parameters:
      reclass:
        storage:
          node:
            ceph_mon_node04:
              name: ${_param:ceph_mon_node04_hostname}
              domain: ${_param:cluster_domain}
              classes:
              - cluster.${_param:cluster_name}.ceph.mon
              params:
                salt_master_host: ${_param:reclass_config_master}
                linux_system_codename:  ${_param:ceph_mon_system_codename}
                single_address: ${_param:ceph_mon_node04_address}
                keepalived_vip_priority: 104
    
  2. Remove the following lines from the cluster/ceph/common.yml file or from the pillar based on your environment:

    parameters:
      ceph:
        common:
          members:
            - name: ${_param:ceph_mon_node04_hostname}
              host: ${_param:ceph_mon_node04_address}
    
  3. Log in to the Jenkins web UI.

  4. Open the Ceph - remove node pipeline.

  5. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the Ceph Monitor node to remove. For example, cmn04*.

    HOST_TYPE

    Add mon as the type of Ceph node that is going to be removed.

  6. Click Deploy.

    The Ceph - remove node pipeline workflow:

    1. Reconfigure the configuration file on all ceph:common minions.

    2. Destroy the VM.

    3. Remove the Salt Minion node ID from salt-key on the Salt Master node.

  7. Remove the following lines from the cluster/infra/kvm.yml file or from the pillar based on your environment:

    parameters:
      salt:
        control:
          cluster:
            internal:
              node:
                cmn04:
                  name: ${_param:ceph_mon_node04_hostname}
                  provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
                  image: ${_param:salt_control_xenial_image}
                  size: ceph.mon
    
  8. Remove the following lines from the cluster/ceph/init.yml file or from the pillar based on your environment:

    _param:
       ceph_mon_node04_hostname: cmn04
       ceph_mon_node04_address: 172.16.47.145
    linux:
      network:
        host:
          cmn04:
            address: ${_param:ceph_mon_node04_address}
            names:
            - ${_param:ceph_mon_node04_hostname}
            - ${_param:ceph_mon_node04_hostname}.${_param:cluster_domain}
    
Remove a Ceph OSD node

This section describes how to remove a Ceph OSD node from a Ceph cluster.

To remove a Ceph OSD node:

  1. If the host is explicitly defined in the model, perform the following steps. Otherwise, proceed to step 2.

    1. In your project repository, remove the following lines from the cluster/ceph/init.yml file or from the pillar based on your environment:

      _param:
         ceph_osd_node05_hostname: osd005
         ceph_osd_node05_address: 172.16.47.72
         ceph_osd_system_codename: xenial
      linux:
        network:
          host:
            osd005:
              address: ${_param:ceph_osd_node05_address}
              names:
              - ${_param:ceph_osd_node05_hostname}
              - ${_param:ceph_osd_node05_hostname}.${_param:cluster_domain}
      
    2. Remove the following lines from the cluster/infra/config/init.yml file or from the pillar based on your environment:

      parameters:
        reclass:
          storage:
            node:
              ceph_osd_node05:
                name: ${_param:ceph_osd_node05_hostname}
                domain: ${_param:cluster_domain}
                classes:
                - cluster.${_param:cluster_name}.ceph.osd
                params:
                  salt_master_host: ${_param:reclass_config_master}
                  linux_system_codename:  ${_param:ceph_osd_system_codename}
                  single_address: ${_param:ceph_osd_node05_address}
                  ceph_crush_parent: rack02
      
  2. Log in to the Jenkins web UI.

  3. Open the Ceph - remove node pipeline.

  4. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the Ceph OSD node to remove. For example, osd005*.

    HOST_TYPE

    Add osd as the type of Ceph node that is going to be removed.

    GENERATE_CRUSHMAP

    Select if the CRUSH map file should be updated. Enforce has to happen manually unless it is specifically set to be enforced in pillar.

    ADMIN_HOST

    Add cmn01* as the Ceph cluster node with the admin keyring.

    WAIT_FOR_HEALTHY

    Verify that this parameter is selected as it enables the Ceph health check within the pipeline.

  5. Click Deploy.

    The Ceph - remove node pipeline workflow:

    1. Mark all Ceph OSDs running on the specified HOST as out. If you selected the WAIT_FOR_HEALTHY parameter, Jenkins pauses the execution of the pipeline until the data migrates to a different Ceph OSD.

    2. Stop all Ceph OSDs services running on the specified HOST.

    3. Remove all Ceph OSDs running on the specified HOST from the CRUSH map.

    4. Remove all Ceph OSD authentication keys running on the specified HOST.

    5. Remove all Ceph OSDs running on the specified HOST from Ceph cluster.

    6. Purge CEPH packages from the specified HOST.

    7. Stop the Salt Minion node on the specified HOST.

    8. Remove all Ceph OSDs running on the specified HOST from Ceph cluster.

    9. Remove the Salt Minion node ID from salt-key on the Salt Master node.

    10. Update the CRUSHMAP file on the I@ceph:setup:crush node if GENERATE_CRUSHMAP was selected. You must manually apply the update unless it is specified otherwise in the pillar.

  6. If you selected GENERATE_CRUSHMAP, check the updated /etc/ceph/crushmap file on cmn01. If it is correct, apply the CRUSH map:

    crushtool -c /etc/ceph/crushmap -o /etc/ceph/crushmap.compiled
    ceph osd setcrushmap -i /etc/ceph/crushmap.compiled
    
Remove a RADOS Gateway node

This section describes how to remove a RADOS Gateway (rgw) node from a Ceph cluster.

To remove a RADOS Gateway node:

  1. In your project repository, remove the following lines from the cluster/ceph/rgw.yml file or from the pillar based on your environment:

    parameters:
      _param:
        cluster_node04_hostname: ${_param:ceph_rgw_node04_hostname}
        cluster_node04_address: ${_param:ceph_rgw_node04_address}
      ceph:
        common:
          keyring:
            rgw.rgw04:
              caps:
                mon: "allow rw"
                osd: "allow rwx"
      haproxy:
        proxy:
          listen:
            radosgw:
              servers:
                - name: ${_param:cluster_node04_hostname}
                  host: ${_param:cluster_node04_address}
                  port: ${_param:haproxy_radosgw_source_port}
                  params: check
    
  2. Remove the following lines from the cluster/infra/config/init.yml file or from the pillar based on your environment:

    parameters:
      reclass:
        storage:
          node:
            ceph_rgw_node04:
              name: ${_param:ceph_rgw_node04_hostname}
              domain: ${_param:cluster_domain}
              classes:
              - cluster.${_param:cluster_name}.ceph.rgw
              params:
                salt_master_host: ${_param:reclass_config_master}
                linux_system_codename:  ${_param:ceph_rgw_system_codename}
                single_address: ${_param:ceph_rgw_node04_address}
                keepalived_vip_priority: 104
    
  3. Log in to the Jenkins web UI.

  4. Open the Ceph - remove node pipeline.

  5. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the RADOS Gateway node to remove. For example, rgw04*.

    HOST_TYPE

    Add rgw as the type of Ceph node that is going to be removed.

  6. Click Deploy.

    The Ceph - remove node pipeline workflow:

    1. Reconfigure HAProxy on the rest of RADOS Gateway nodes.

    2. Destroy the VM.

    3. Remove the Salt Minion node ID from salt-key on the Salt Master node.

  7. Remove the following lines from the cluster/infra/kvm.yml file or from the pillar based on your environment:

    parameters:
      salt:
        control:
          cluster:
            internal:
              node:
                rgw04:
                  name: ${_param:ceph_rgw_node04_hostname}
                  provider: ${_param:infra_kvm_node03_hostname}.${_param:cluster_domain}
                  image: ${_param:salt_control_xenial_image}
                  size: ceph.rgw
    
  8. Remove the following lines from the cluster/ceph/init.yml file or from the pillar based on your environment:

    _param:
      ceph_rgw_node04_hostname: rgw04
      ceph_rgw_node04_address: 172.16.47.162
    linux:
      network:
        host:
          rgw04:
            address: ${_param:ceph_rgw_node04_address}
            names:
            - ${_param:ceph_rgw_node04_hostname}
            - ${_param:ceph_rgw_node04_hostname}.${_param:cluster_domain}
    
Replace a failed Ceph OSD

This section instructs you on how to replace a failed physical node with a Ceph OSD or multiple OSD nodes running on it using the Ceph - replace failed OSD Jenkins pipeline.

To replace a failed physical node with a Ceph OSD or multiple OSD nodes:

  1. Log in to the Jenkins web UI.

  2. Open the Ceph - replace failed OSD pipeline.

  3. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the Ceph OSD node. For example, osd005*.

    OSD

    Add a comma-separated list of Ceph OSDs on the specified HOST node. For example 1,2.

    DEVICE 0

    Add a comma-separated list of failed devices to replace at HOST. For example, /dev/sdb,/dev/sdc.

    DATA_PARTITION: 0

    (Optional) Add a comma-separated list of mounted partitions of the failed device. These partitions will be unmounted. We recommend that multiple OSD nodes per device are used. For example, /dev/sdb1,/dev/sdb3.

    JOURNAL_BLOCKDB_BLOCKWAL_PARTITION: 0

    Add a comma-separated list of partitions that store journal, block_db, or block_wal of the failed devices on the specified HOST. For example, /dev/sdh2,/dev/sdh3.

    ADMIN_HOST

    Add cmn01* as the Ceph cluster node with the admin keyring.

    CLUSTER_FLAGS

    Add a comma-separated list of flags to apply before and after the pipeline.

    WAIT_FOR_HEALTHY

    Select to perform the Ceph health check within the pipeline.

    DMCRYPT 0

    Select if you are replacing an encrypted OSD. In such case, also specify noout,norebalance as CLUSTER_FLAGS.

  4. Click Deploy.

The Ceph - replace failed OSD pipeline workflow:

  1. Mark the Ceph OSD as out.

  2. Wait until the Ceph cluster is in a healthy state if WAIT_FOR_HEALTHY was selected. In this case. Jenkins pauses the execution of the pipeline until the data migrates to a different Ceph OSD.

  3. Stop the Ceph OSD service.

  4. Remove the Ceph OSD from the CRUSH map.

  5. Remove the Ceph OSD authentication key.

  6. Remove the Ceph OSD from the Ceph cluster.

  7. Unmount data partition(s) of the failed disk.

  8. Delete the partition table of the failed disk.

  9. Remove the partition from the block_db, block_wal, or journal.

  10. Perform one of the following depending on the MCP release version:

    • For deployments prior to the MCP 2019.2.3 update, redeploy the failed Ceph OSD.

    • For deployments starting from the MCP 2019.2.3 update:

      1. Wait for the hardware replacement and confirmation to proceed.

      2. Redeploy the failed Ceph OSD on the replaced hardware.

Note

If any of the steps 1 - 9 has already been performed manually, Jenkins proceeds to the next step.

0(1,2,3,4)

The parameter has been removed starting from the MCP 2019.2.3 update.

Restrict the RADOS Gateway capabilities

Note

This feature is available starting from the MCP 2019.2.10 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

To avoid a potential security vulnerability, Mirantis recommends that you restrict the RADOS Gateway capabilities of your existing MCP deployment to a bare minimum.

To restrict the RADOS Gateway capabilities of an existing MCP deployment:

  1. Open your project Git repository with the Reclass model on the cluster level.

  2. In cluster/ceph/rgw.yml, modify the RADOS Gateway capabilities as follows:

    ceph:
      common:
        keyring:
          rgw.rgw01:
            caps:
              mon: "allow rw"
              osd: "allow rwx"
          rgw.rgw02:
            caps:
              mon: "allow rw"
              osd: "allow rwx"
          rgw.rgw03:
            caps:
              mon: "allow rw"
              osd: "allow rwx"
    
  3. Log in to the Salt Master node.

  4. Apply the changes:

    salt -I ceph:radosgw state.apply ceph.common,ceph.setup.keyring
    

Enable the Ceph Prometheus plugin

If you have deployed StackLight LMA, you can enhance Ceph monitoring by enabling the Ceph Prometheus plugin that is based on the native Prometheus exporter introduced in Ceph Luminous. In this case, the Ceph Prometheus plugin, instead of Telegraf, collects Ceph metrics providing a wider set of graphs in the Grafana web UI, such as an overview of the Ceph cluster, hosts, OSDs, pools, RADOS gateway nodes, as well as detailed graphs on the Ceph OSD and RADOS Gateway nodes. You can enable the Ceph Prometheus plugin manually on an existing MCP cluster as described below or during the upgrade of StackLight LMA as described in Upgrade StackLight LMA using the Jenkins job.

To enable the Ceph Prometheus plugin manually:

  1. Update the Ceph formula package.

  2. Open your project Git repository with Reclass model on the cluster level.

  3. In classes/cluster/cluster_name/ceph/mon.yml, remove the service.ceph.monitoring.cluster_stats class.

  4. In classes/cluster/cluster_name/ceph/osd.yml, remove the service.ceph.monitoring.node_stats class.

  5. Log in to the Salt Master node.

  6. Refresh grains to set the new alerts and graphs:

    salt '*' state.sls salt.minion.grains
    
  7. Enable the Prometheus plugin:

    salt -C I@ceph:mon state.sls ceph.mgr
    
  8. Update the targets and alerts in Prometheus:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus
    
  9. Update the new Grafana dashboards:

    salt -C 'I@grafana:client' state.sls grafana
    
  10. (Optional) Enable the StackLight LMA prediction alerts for Ceph.

    Note

    This feature is available as technical preview. Use such configuration for testing and evaluation purposes only.

    Warning

    This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

    1. Open your project Git repository with Reclass model on the cluster level.

    2. In classes/cluster/cluster_name/ceph/common.yml, set enable_prediction to True:

      parameters:
        ceph:
          common:
            enable_prediction: True
      
    3. Log in to the Salt Master node.

    4. Refresh grains to set the new alerts and graphs:

      salt '*' state.sls salt.minion.grains
      
    5. Verify and update the alerts thresholds based on the cluster hardware.

      Note

      For details about tuning the thresholds, contact Mirantis support.

    6. Update the targets and alerts in Prometheus:

      salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus
      
  11. Customize Ceph prediction alerts as described in Ceph.

Enable Ceph compression

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

RADOS Gateway supports server-side compression of uploaded objects using the Ceph compression plugins. You can manually enable Ceph compression to rationalize the capacity usage on the MCP cluster.

To enable Ceph compression:

  1. Log in to any rgw node.

  2. Run the radosgw-admin zone placement modify command with the --compression=<type> option specifying the compression plugin type and other options as required. The available compression plugins to use when writing a new object data are zlib, snappy, or zstd. For example:

    radosgw-admin zone placement modify \
      --rgw-zone default \
      --placement-id default-placement \
      --storage-class STANDARD \
      --compression zlib
    

    Note

    If you have not previously performed any Multi-site configuration, you can use the default values for the options except compression. To disable compression, set the compression type to an empty string or none.

See also

Ceph compression

Enable the ceph-volume tool

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Warning

  • Prior to the 2019.2.10 maintenance update, ceph-volume is available as technical preview only.

  • Starting from the 2019.2.10 maintenance update, the ceph-volume tool is fully supported and must be enabled prior to upgrading from Ceph Luminous to Nautilus. The ceph-disk tool is deprecated.

This section describes how to enable the ceph-volume command-line tool that enables you to deploy and inspect Ceph OSDs using the Logical Volume Management (LVM) functionality for provisioning block devices. The main difference between ceph-disk and ceph-volume is that ceph-volume does not automatically partition disks used for block.db. However, partitioning is performed within the procedure below.

To enable the ceph-volume tool:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. If you are upgrading from Ceph Luminous to Nautilus, specify the legacy_disks pillar in classes/cluster/<cluster_name>/ceph/osd.yml to allow the operation of both ceph-disk and ceph-volume-deployed OSDs:

    parameters:
      ceph:
        osd:
          legacy_disks:
            0:
               class: hdd
               weight: 0.048691
               dev: /dev/vdc
    
  3. In classes/cluster/<cluster_name>/ceph/osd.yml, define the partitions and logical volumes to use as OSD devices.

    parameters:
      linux:
        storage:
          disk:
            osd_blockdev:
              startsector: 1
              name: /dev/vdd
              type: gpt
              partitions:
                - size: 10240
                - size: 10240
          lvm:
            ceph:
              enabled: true
              devices:
              - /dev/vdc
              volume:
                osd01:
                  size: 15G
                osd01:
                  size: 15G
    
  4. In classes/cluster/<cluster_name>/ceph/osd.yml, set the lvm_enabled parameter to True:

    parameters:
      ceph:
        osd:
          lvm_enabled: True
    
  5. Apply the changes:

    salt -C 'I@ceph:osd' saltutil.refresh_pillar
    
  6. Remove the OSD nodes as described in Remove a Ceph OSD node.

  7. Add new OSD nodes as described in Add a Ceph OSD node.

Shut down a Ceph cluster for maintenance

This section describes how to properly shut down an entire Ceph cluster for maintenance and bring it up afterward.

To shut down a Ceph cluster for maintenance:

  1. Log in to the Salt Master node.

  2. Stop the OpenStack workloads.

  3. Stop the services that are using the Ceph cluster. For example:

    • Manila workloads (if you have shares on top of Ceph mount points)

    • heat-engine (if it has the autoscaling option enabled)

    • glance-api (if it uses Ceph to store images)

    • cinder-scheduler (if it uses Ceph to store images)

  4. Identify the first Ceph Monitor for operations:

    CEPH_MON=$(salt -C 'I@ceph:mon' --out=txt test.ping | sort | head -1 | \
        cut -d: -f1)
    
  5. Verify that the Ceph cluster is in healthy state:

    salt "${CEPH_MON}" cmd.run 'ceph -s'
    

    Example of system response:

    cmn01.domain.com:
            cluster e0b75d1b-544c-4e5d-98ac-cfbaf29387ca
             health HEALTH_OK
             monmap e3: 3 mons at {cmn01=192.168.16.14:6789/0,cmn02=192.168.16.15:6789/0,cmn03=192.168.16.16:6789/0}
                    election epoch 42, quorum 0,1,2 cmn01,cmn02,cmn03
             osdmap e102: 6 osds: 6 up, 6 in
                    flags sortbitwise,require_jewel_osds
              pgmap v41138: 384 pgs, 6 pools, 45056 kB data, 19 objects
                    798 MB used, 60575 MB / 61373 MB avail
                         384 active+clean
    
  6. Set the following flags to disable rebalancing and restructuring and to pause the Ceph cluster:

    salt "${CEPH_MON}" cmd.run 'ceph osd set noout'
    salt "${CEPH_MON}" cmd.run 'ceph osd set nobackfill'
    salt "${CEPH_MON}" cmd.run 'ceph osd set norecover'
    salt "${CEPH_MON}" cmd.run 'ceph osd set norebalance'
    salt "${CEPH_MON}" cmd.run 'ceph osd set nodown'
    salt "${CEPH_MON}" cmd.run 'ceph osd set pause'
    
  7. Verify that the flags are set:

    salt "${CEPH_MON}" cmd.run 'ceph -s'
    

    Example of system response:

    cmn01.domain.com:
            cluster e0b75d1b-544c-4e5d-98ac-cfbaf29387ca
             health **HEALTH_WARN**
                    **pauserd**,**pausewr**,**nodown**,**noout**,**nobackfill**,**norebalance**,**norecover** flag(s) set
             monmap e3: 3 mons at {cmn01=192.168.16.14:6789/0,cmn02=192.168.16.15:6789/0,cmn03=192.168.16.16:6789/0}
                    election epoch 42, quorum 0,1,2 cmn01,cmn02,cmn03
             osdmap e108: 6 osds: 6 up, 6 in
                    flags **pauserd**,**pausewr**,**nodown**,**noout**,**nobackfill**,**norebalance**,**norecover**,sortbitwise,require_jewel_osds
              pgmap v41152: 384 pgs, 6 pools, 45056 kB data, 19 objects
                    799 MB used, 60574 MB / 61373 MB avail
                         384 active+clean
    
  8. Shut down the Ceph cluster.

    Warning

    Shut down the nodes one by one in the following order:

    1. Service nodes (for example, RADOS Gateway nodes)

    2. Ceph OSD nodes

    3. Ceph Monitor nodes

Once done, perform the maintenance as required.


To start a Ceph cluster after maintenance:

  1. Log in to the Salt Master node.

  2. Start the Ceph cluster nodes.

    Warning

    Start the Ceph nodes one by one in the following order:

    1. Ceph Monitor nodes

    2. Ceph OSD nodes

    3. Service nodes (for example, RADOS Gateway nodes)

  3. Verify that the Salt minions are up:

    salt -C "I@ceph:common" test.ping
    
  4. Verify that the date is the same for all Ceph clients:

    salt -C "I@ceph:common" cmd.run date
    
  5. Identify the first Ceph Monitor for operations:

    CEPH_MON=$(salt -C 'I@ceph:mon' --out=txt test.ping | sort | head -1 | \
        cut -d: -f1)
    
  6. Unset the following flags to resume the Ceph cluster:

    salt "${CEPH_MON}" cmd.run 'ceph osd unset pause'
    salt "${CEPH_MON}" cmd.run 'ceph osd unset nodown'
    salt "${CEPH_MON}" cmd.run 'ceph osd unset norebalance'
    salt "${CEPH_MON}" cmd.run 'ceph osd unset norecover'
    salt "${CEPH_MON}" cmd.run 'ceph osd unset nobackfill'
    salt "${CEPH_MON}" cmd.run 'ceph osd unset noout'
    
  7. Verify that the Ceph cluster is in healthy state:

    salt "${CEPH_MON}" cmd.run 'ceph -s'
    

    Example of system response:

    cmn01.domain.com:
            cluster e0b75d1b-544c-4e5d-98ac-cfbaf29387ca
             health HEALTH_OK
             monmap e3: 3 mons at {cmn01=192.168.16.14:6789/0,cmn02=192.168.16.15:6789/0,cmn03=192.168.16.16:6789/0}
                    election epoch 42, quorum 0,1,2 cmn01,cmn02,cmn03
             osdmap e102: 6 osds: 6 up, 6 in
                    flags sortbitwise,require_jewel_osds
              pgmap v41138: 384 pgs, 6 pools, 45056 kB data, 19 objects
                    798 MB used, 60575 MB / 61373 MB avail
                         384 active+clean
    

Back up and restore Ceph

This section describes how to back up and restore Ceph OSD nodes metadata and Ceph Monitor nodes.

Note

This documentation does not provide instructions on how to back up the data stored in Ceph.

Create a backup schedule for Ceph nodes

This section describes how to manually create a backup schedule for Ceph OSD nodes metadata and for Ceph Monitor nodes.

By default, the backing up functionality enables automatically for the new MCP OpenStack with Ceph deployments in the cluster models generated using Model Designer. Use this procedure in case of manual deployment only or if you want to change the default backup configuration.

Note

The procedure below does not cover the backup of the Ceph OSD node data.

To create a backup schedule for Ceph nodes:

  1. Log in to the Salt Master node.

  2. Decide on which node you want to store the backups.

  3. Get <STORAGE_ADDRESS> of the node from point 2.

    cfg01:~\# salt NODE_NAME grains.get fqdn_ip4
    
  4. Configure the ceph backup server role by adding the cluster.deployment_name.infra.backup.server class to the definition of the target storage node from step 2:

    classes:
    - cluster.deployment_name.infra.backup.server
     parameters:
       _param:
          ceph_backup_public_key: <generate_your_keypair>
    

    By default, adding this include statement results in Ceph keeping five complete backups. To change the default setting, add the following pillar to the cluster/infra/backup/server.yml file:

    parameters:
      ceph:
        backup:
          server:
            enabled: true
            hours_before_full: 24
            full_backups_to_keep: 5
    
  5. To back up the Ceph Monitor nodes, configure the ceph backup client role by adding the following lines to the cluster/ceph/mon.yml file:

    Note

    Change <STORAGE_ADDRESS> to the address of the target storage node from step 2

    classes:
    - system.ceph.backup.client.single
    parameters:
      _param:
        ceph_remote_backup_server: <STORAGE_ADDRESS>
        root_private_key: |
          <generate_your_keypair>
    
  6. To back up the Ceph OSD nodes metadata, configure the ceph backup client role by adding the following lines to the cluster/ceph/osd.yml file:

    Note

    Change <STORAGE_ADDRESS> to the address of the target storage node from step 2

    classes:
    - system.ceph.backup.client.single
    parameters:
      _param:
        ceph_remote_backup_server: <STORAGE_ADDRESS>
        root_private_key: |
          <generate_your_keypair>
    

    By default, adding the above include statement results in Ceph keeping three complete backups on the client node. To change the default setting, add the following pillar to the cluster/ceph/mon.yml or cluster/ceph/osd.yml files:

    Note

    Change <STORAGE_ADDRESS> to the address of the target storage node from step 2

    parameters:
      ceph:
        backup:
          client:
            enabled: true
            full_backups_to_keep: 3
            hours_before_full: 24
            target:
              host: <STORAGE_ADDRESS>
    
  7. Refresh Salt pillars:

    salt -C '*' saltutil.refresh_pillar
    
  8. Apply the salt.minion state:

    salt -C 'I@ceph:backup:client or I@ceph:backup:server' state.sls salt.minion
    
  9. Refresh grains for the ceph client node:

    salt -C 'I@ceph:backup:client' saltutil.sync_grains
    
  10. Update the mine for the ceph client node:

    salt -C 'I@ceph:backup:client' mine.flush
    salt -C 'I@ceph:backup:client' mine.update
    
  11. Apply the following state on the ceph client node:

    salt -C 'I@ceph:backup:client' state.sls openssh.client,ceph.backup
    
  12. Apply the linux.system.cron state on the ceph server node:

    salt -C 'I@ceph:backup:server' state.sls linux.system.cron
    
  13. Apply the ceph.backup state on the ceph server node:

    salt -C 'I@ceph:backup:server' state.sls ceph.backup
    
Create an instant backup of a Ceph OSD node metadata or a Ceph Monitor node

After you create a backup schedule as described in Create a backup schedule for Ceph nodes, you may also need to create an instant backup of a Ceph OSD node metadata or a Ceph Monitor node.

Note

The procedure below does not cover the backup of the Ceph OSD node data.

To create an instant backup of a Ceph node:

  1. Verify that you have completed the steps described in Create a backup schedule for Ceph nodes.

  2. Log in to a Ceph node. For example, to cmn01.

  3. Run the following script:

    /usr/local/bin/ceph-backup-runner-call.sh
    
  4. Verify that a complete backup was created locally:

    ls /var/backups/ceph/full
    
  5. Verify that the complete backup was rsynced to the ceph backup server node from the Salt Master node:

    salt -C 'I@ceph:backup:server' cmd.run 'ls /srv/volumes/backup/ceph/full'
    
Restore a Ceph Monitor node

You may need to restore a Ceph Monitor node after a failure. For example, if the data in the Ceph-related directories disappeared.

To restore a Ceph Monitor node:

  1. Verify that the Ceph Monitor instance is up and running and connected to the Salt Master node.

  2. Log in to the Ceph Monitor node.

  3. Synchronize Salt modules and refresh Salt pillars:

    salt-call saltutil.sync_all
    salt-call saltutil.refresh_pillar
    
  4. Run the following Salt states:

    salt-call state.sls linux,openssh,salt,ntp,rsyslog
    
  5. Manually install Ceph packages:

    apt install ceph-mon -y
    
  6. Remove the following files from Ceph:

    rm -rf /etc/ceph/* /var/lib/ceph/*
    
  7. From the Ceph backup, copy the files from /etc/ceph/ and /var/lib/ceph to their original directories:

    cp -r /<etc_ceph_backup_path>/* /etc/ceph/
    cp -r /<var_lib_ceph_backup_path>/* /var/lib/ceph/
    
  8. Change the files ownership:

    chown -R ceph:ceph /var/lib/ceph/*
    
  9. Run the following Salt state:

    salt-call state.sls ceph
    

    If the output contains an error, rerun the state.

Restore the metadata of a Ceph OSD node

You may need to restore the metadata of a Ceph OSD node after a failure. For example, if the primary disk fails or the data in the Ceph-related directories, such as /var/lib/ceph/, on the OSD node disappeared.

To restore the metadata of a Ceph OSD node:

  1. Verify that the Ceph OSD node is up and running and connected to the Salt Master node.

  2. Log in to the Ceph OSD node.

  3. Synchronize Salt modules and refresh Salt pillars:

    salt-call saltutil.sync_all
    salt-call saltutil.refresh_pillar
    
  4. Run the following Salt states:

    salt-call state.sls linux,openssh,salt,ntp,rsyslog
    
  5. Manually install Ceph packages:

    apt install ceph-osd -y
    
  6. Stop all ceph-osd services:

    systemctl stop ceph-osd@<num>
    
  7. Remove the following files from Ceph:

    rm -rf /etc/ceph/* /var/lib/ceph/*
    
  8. From the Ceph backup, copy the files from /etc/ceph/ and /var/lib/ceph to their original directories:

    cp -r /<path>/* /etc/ceph/
    cp -r /<path>/* /var/lib/ceph/
    
  9. Change the files ownership:

    chown -R ceph:ceph /var/lib/ceph/*
    
  10. Restart the services for all Ceph OSDs:

    systemctl restart ceph-osd@<osd_num>
    

Migrate the Ceph back end

Ceph uses FileStore or BlueStore as a storage back end. You can migrate the Ceph storage back end from FileStore to BlueStore and vice versa using the Ceph - backend migration pipeline.

Note

The BlueStore back end is only supported if your Ceph version is Luminous or newer.

To migrate the Ceph back end:

  1. In your project repository, open the cluster/ceph/osd.yml file for editing:

    1. Change the back end type and block_db or journal for every OSD disk device.

    2. Specify the size of the journal or block_db device if it resides on another device than the storage device. The device storage will be divided equally by the number of OSDs using it.

    Example:

    parameters:
      ceph:
        osd:
          bluestore_block_db_size: 10073741824
    #      journal_size: 10000
          backend:
    #        filestore:
            bluestore:
              disks:
              - dev: /dev/sdh
                block_db: /dev/sdj
    #            journal: /dev/sdj
    

    Where the commented lines are the example lines that must be replaced and removed if migrating from FileStore to BlueStore.

  2. Log in to the Jenkins web UI.

  3. Open the Ceph - backend migration pipeline.

  4. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    ADMIN_HOST

    Add cmn01* as the Ceph cluster node with the admin keyring.

    TARGET

    Add the Salt target name of the Ceph OSD node(s). For example, osd005* to migrate on one OSD HOST or osd* to migrate on all OSD hosts.

    OSD

    Add * to target all OSD disks on all TARGET OSD hosts or comma-separated list of Ceph OSDs if targeting just one OSD host by TARGET For example 1,2.

    WAIT_FOR_HEALTHY

    Verify that this parameter is selected as it enables the Ceph health check within the pipeline.

    PER_OSD_CONTROL

    Select to verify the Ceph status after migration of each OSD disk.

    PER_OSD_HOST_CONTROL

    Select to verify the Ceph status after the whole OSD host migration.

    CLUSTER_FLAGS

    Add a comma-separated list of flags to apply for the migration procedure. Tested with blank.

    ORIGIN_BACKEND

    Specify the Ceph back end before migration.

    Note

    The PER_OSD_CONTROL and PER_OSD_HOST_CONTROL options provide granular control during the migration to verify each OSD disk after its migration. You can decide to continue or abort.

  5. Click Deploy.

The Ceph - upgrade pipeline workflow:

  1. Set back-end migration flags.

  2. Perform the following for each targeted OSD disk:

    1. Mark the Ceph OSD as out.

    2. Stop the Ceph OSD service.

    3. Remove the Ceph OSD authentication key.

    4. Remove the Ceph OSD from the Ceph cluster

    5. Remove block_db, block_wal, or journal of the OSD.

  3. Run the ceph.osd state to deploy the OSD with a desired back end.

  4. Unset the back-end migration flags.

Note

During the pipeline execution, a check is performed to verify whether the back end type for an OSD disk differs from the one specified in ORIGIN_BACKEND. If the back end differs, Jenkins does not apply any changes to that OSD disk.

Migrate the management of a Ceph cluster

You can migrate the management of an existing Ceph cluster deployed by Decapod to a cluster managed by the Ceph Salt formula.

To migrate the management of a Ceph cluster:

  1. Log in to the Decapod web UI.

  2. Navigate to the CONFIGURATIONS tab.

  3. Select the required configuration and click VIEW.

  4. Generate a new cluster model with Ceph as described in MCP Deployment Guide: Create a deployment metadata model using the Model Designer. Verify that you fill in the correct values from the Decapod configuration file displayed in the VIEW tab of the Decapod web UI.

  5. In the <cluster_name>/ceph/setup.yml file, specify the right pools and parameters for the existing pools.

    Note

    Verify that the keyring names and their caps match the ones that already exist in the Ceph cluster deployed by Decapod.

  6. In the <cluster_name>/infra/config.yml file, add the following pillar and modify the parameters according to your environment:

    ceph:
      decapod:
        ip: 192.168.1.10
        user: user
        pass: psswd
        deploy_config_name: ceph
    
  7. On the node defined in the previous step, apply the following state:

    salt-call state.sls ceph.migration
    

    Note

    The output of this state must contain defined configurations, Ceph OSD disks, Ceph File System ID (FSID), and so on.

  8. Using the output of the previous command, add the following pillars to your cluster model:

    1. Add the ceph:common pillar to <cluster_name>/ceph/common.yml.

    2. Add the ceph:osd pillar to <cluster_name>/ceph/osd.yml.

  9. Examine the newly generated cluster model for any occurrence of the ceph keyword and verify that it exists in your current cluster model.

  10. Examine each Ceph cluster file to verify that the parameters match the configuration specified in Decapod.

  11. Copy the Ceph cluster directory to the existing cluster model.

  12. Verify that the ceph subdirectory is included in your cluster model in <cluster_name>/infra/init.yml or <cluster_name>/init.yml for older cluster model versions:

    classes:
    - cluster.<cluster_name>.ceph
    
  13. Add the Reclass storage nodes to <cluster_name>/infra/config.yml and change the count variable to the number of OSDs you have. For example:

    classes:
    - system.reclass.storage.system.ceph_mon_cluster
    - system.reclass.storage.system.ceph_rgw_cluster # Add this line only if
    # RadosGW services run on separate nodes than the Ceph Monitor services.
    parameters:
      reclass:
        storage:
          node:
            ceph_osd_rack01:
              name: ${_param:ceph_osd_rack01_hostname}<<count>>
              domain: ${_param:cluster_domain}
              classes:
                - cluster.${_param:cluster_name}.ceph.osd
              repeat:
                count: 3
                start: 1
                digits: 3
                params:
                  single_address:
                    value: ${_param:ceph_osd_rack01_single_subnet}.<<count>>
                    start: 201
                  backend_address:
                    value: ${_param:ceph_osd_rack01_backend_subnet}.<<count>>
                    start: 201
    
  14. If the Ceph RADOS Gateway service is running on the same nodes as the Ceph monitor services:

    1. Add the following snippet to <cluster_name>/infra/config.yml:

      reclass:
        storage:
          node:
            ceph_mon_node01:
              classes:
              - cluster.${_param:cluster_name}.ceph.rgw
            ceph_mon_node02:
              classes:
              - cluster.${_param:cluster_name}.ceph.rgw
            ceph_mon_node03:
              classes:
              - cluster.${_param:cluster_name}.ceph.rgw
      
    2. Verify that the parameters in <cluster_name>/ceph/rgw.yml are defined correctly according to the existing Ceph cluster.

  15. From the Salt Master node, generate the Ceph nodes:

    salt-call state.sls reclass
    
  16. Run the commands below.

    Warning

    If the outputs of the commands below contain any changes that can potentially break the cluster, change the cluster model as needed and optionally run the salt-call pillar.data ceph command to verify that the Salt pillar contains the correct value. Proceed to the next step only once you are sure that your model is correct.

    • From the Ceph monitor nodes:

      salt-call state.sls ceph test=True
      
    • From the Ceph OSD nodes:

      salt-call state.sls ceph test=True
      
    • From the Ceph RADOS Gateway nodes:

      salt-call state.sls ceph test=True
      
    • From the Salt Master node:

      salt -C 'I@ceph:common' state.sls ceph test=True
      
  17. Once you have verified that no changes by the Salt Formula can break the running Ceph cluster, run the following commands.

    • From the Salt Master node:

      salt -C 'I@ceph:common:keyring:admin' state.sls ceph.mon
      salt -C 'I@ceph:mon' saltutil.sync_grains
      salt -C 'I@ceph:mon' mine.update
      salt -C 'I@ceph:mon' state.sls ceph.mon
      
    • From one of the OSD nodes:

      salt-call state.sls ceph.osd
      

      Note

      Before you proceed, verify that the OSDs on this node are working fine.

    • From the Salt Master node:

      salt -C 'I@ceph:osd' state.sls ceph.osd
      
    • From the Salt Master node:

      salt -C 'I@ceph:radosgw' state.sls ceph.radosgw
      

Enable RBD monitoring

Warning

This feature is available as technical preview starting from the MCP 2019.2.10 maintenance update and requires Ceph Nautilus. Use such configuration for testing and evaluation purposes only. Before using the feature, follow the steps described in Apply maintenance updates.

If required, you can enable RADOS Block Device (RBD) images monitoring introduced with Ceph Nautilus. Once done, you can view RBD metrics using the Ceph RBD Overview Grafana dashboard. For details, see Ceph dashboards.

To enable RBD monitoring:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In classes/cluster/<cluster_name>/ceph/setup.yml add the rbd_stats flag for pools serving RBD images to enable serving RBD metrics:

    parameters:
      ceph:
        setup:
          pool:
            <pool_name>:
              pg_num: 8
              pgp_num: 8
              type: replicated
              application: rbd
              rbd_stats: True
    
  3. In classes/cluster/<cluster_name>/ceph/common.yml, set the rbd_monitoring_enabled parameter to True to enable the Ceph RBD Overview Grafana dashboard:

    ceph:
      common:
        public_network: 10.13.0.0/16
        cluster_network: 10.12.0.0/16
        rbd_monitoring_enabled: True
    
  4. Log in to the Salt Master node.

  5. Apply the changes:

    salt "*" saltutil.refresh_pillar
    salt "*" state.apply salt.minion.grains
    salt "*" saltutil.refresh_grains
    salt -C "I@ceph:mgr" state.apply 'ceph.mgr'
    salt -C 'I@grafana:client' state.apply 'grafana.client'
    

Glance operations

This section describes the OpenStack Image service (Glance) operations you may need to perform after the deployment of an MCP cluster.

Enable uploading of an image through Horizon with self-managed SSL certificates

By default, the OpenStack Dashboard (Horizon) supports direct uploading of images to Glance. However, if an MCP cluster is deployed using self-signed certificates for public API endpoints and Horizon, uploading of images to Glance through the Horizon web UI may fail. While accessing the Horizon web UI of such MCP deployment for the first time, a warning informs that the site is insecure and you must force trust the certificate of this site. However, when trying to upload an image directly from the web browser, the certificate of the Glance API is still not considered by the web browser as a trusted one since host:port of the site is different. In this case, you must explicitly trust the certificate of the Glance API.

To enable uploading of an image through Horizon with self-managed SSL certificates:

  1. Navigate to the Horizon web UI.

  2. On the page that opens, configure your web browser to trust the Horizon certificate if you have not done so yet:

    • In Google Chrome or Chromium, click Advanced > Proceed to <URL> (unsafe).

    • In Mozilla Firefox, navigate to Advanced > Add Exception, enter the URL in the Location field, and click Confirm Security Exception.

    Note

    For other web browsers, the steps may vary slightly.

  3. Navigate to Project > API Access.

  4. Copy the Service Endpoint URL of the Image service.

  5. Open this URL in a new window or tab of the same web browser.

  6. Configure your web browser to trust the certificate of this site as described in the step 2.

    As a result, the version discovery document should appear with contents depending on the OpenStack version. For example, for OpenStack Ocata:

    {"versions": [{"status": "CURRENT", "id": "v2.5", "links": \
    [{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
    {"status": "SUPPORTED", "id": "v2.4", "links": \
    [{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
    {"status": "SUPPORTED", "id": "v2.3", "links": \
    [{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
    {"status": "SUPPORTED", "id": "v2.2", "links": \
    [{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
    {"status": "SUPPORTED", "id": "v2.1", "links": \
    [{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}, \
    {"status": "SUPPORTED", "id": "v2.0", "links": \
    [{"href": "http://cloud-cz.bud.mirantis.net:9292/v2/", "rel": "self"}]}]}
    

Once done, you should be able to upload an image through Horizon with self-managed SSL certificates.

Telemetry operations

This section describes the Tenant Telemetry service (Ceilometer) operations you may need to perform after the deployment of an MCP cluster.

Enable the Gnocchi archive policies in Tenant Telemetry

The Gnocchi archive policies allow you to define the aggregation and storage policies for metrics received from Ceilometer.

Each archive policy definition is set as the number of points over a timespan. The default archive policy contains two definitions and one rule. It allows you to store metrics for seven days with granularity of one minute and for 365 days with granularity of one hour. It is applied to any metrics sent to Gnocchi with the metric pattern *. You can customize all parameters on the cluster level of your Reclass model.

To enable the Gnocchi archive policies:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In /openstack/telemetry.yml, verify that the following class is present:

    classes:
    ...
    - system.ceilometer.server.backend.gnocchi
    
  3. In /openstack/control/init.yml, add the following classes:

    classes:
    ...
    - system.gnocchi.client
    - system.gnocchi.client.v1.archive_policy.default
    

    The parameters of system.gnocchi.client.v1.archive_policy.default are as follows:

    parameters:
      _param:
        gnocchi_default_policy_granularity_1: '0:01:00'
        gnocchi_default_policy_points_1: 10080
        gnocchi_default_policy_timespan_1: '7 days'
        gnocchi_default_policy_granularity_2: '1:00:00'
        gnocchi_default_policy_points_2: 8760
        gnocchi_default_policy_timespan_2: '365 days'
        gnocchi_default_policy_rule_metric_pattern: '"*"'
      gnocchi:
        client:
          resources:
            v1:
              enabled: true
              cloud_name: 'admin_identity'
              archive_policies:
                default:
                  definition:
                    - granularity: "${_param:gnocchi_default_policy_granularity_1}"
                      points: "${_param:gnocchi_default_policy_points_1}"
                      timespan: "${_param:gnocchi_default_policy_timespan_1}"
                    - granularity: "${_param:gnocchi_default_policy_granularity_2}"
                      points: "${_param:gnocchi_default_policy_points_2}"
                      timespan: "${_param:gnocchi_default_policy_timespan_2}"
                  rules:
                    default:
                      metric_pattern: "${_param:gnocchi_default_policy_rule_metric_pattern}"
    
  4. Optional. Specify additional archive policies as required. For example, to aggregate the CPU and disk-related metrics with the timespan of 30 days and granularity 1, add the following parameters to /openstack/control/init.yml under the default Gnocchi archive policy parameters:

    parameters:
      _param:
      ...
      gnocchi:
        client:
          resources:
            v1:
              enabled: true
              cloud_name: 'admin_identity'
              archive_policies:
                default:
                ...
                cpu_disk_policy:
                  definition:
                    - granularity: '0:00:01'
                      points: 2592000
                      timespan: '30 days'
                  rules:
                  cpu_rule:
                    metric_pattern: 'cpu*'
                  disk_rule:
                    metric_pattern: 'disk*'
    

    Caution

    Rule names defined across archive policies must be unique.

  5. Log in to the Salt Master node.

  6. Apply the following states:

    salt -C 'I@gnocchi:client and *01*' saltutil.pillar_refresh
    salt -C 'I@gnocchi:client and *01*' state.sls gnocchi.client
    salt -C 'I@gnocchi:client' state.sls gnocchi.client
    
  7. Verify that the archive policies are set successfully:

    1. Log in to any OpenStack controller node.

    2. Boot a test VM:

      source keystonercv3
      openstack server create --flavor <flavor_id> \
      --nic net-id=<net_id> --image <image_id>  test_vm1
      
    3. Run the following command:

      openstack metric list | grep <vm_id>
      

      Use the vm_id parameter value from the output of the command that you run in the previous step.

      Example of system response extract:

      +---------+-------------------+-------------------------------+------+-----------+
      | id      |archive_policy/name| name                          | unit |resource_id|
      +---------+-------------------+-------------------------------+------+-----------+
      | 0ace... | cpu_disk_policy   | disk.allocation               | B    | d9011...  |
      | 0ca6... | default           | perf.instructions             | None | d9011...  |
      | 0fcb... | default           | compute.instance.booting.time | sec  | d9011...  |
      | 10f0... | cpu_disk_policy   | cpu_l3_cache                  | None | d9011...  |
      | 2392... | default           | memory                        | MB   | d9011...  |
      | 2395... | cpu_disk_policy   | cpu_util                      | %    | d9011...  |
      | 26a0... | default           | perf.cache.references         | None | d9011...  |
      | 367e... | cpu_disk_policy   | disk.read.bytes.rate          | B/s  | d9011...  |
      | 3857... | default           | memory.bandwidth.total        | None | d9011...  |
      | 3bb2... | default           | memory.usage                  | None | d9011...  |
      | 4288... | cpu_disk_policy   | cpu                           | ns   | d9011...  |
      +---------+-------------------+-------------------------------+------+-----------+
      

    In the example output above, all metrics are aggregated using the default archive policy except for the CPU and disk metrics aggregated by cpu_disk_policy. The cpu_disk_policy parameters were previously customized in the Reclass model.

Add availability zone to Gnocchi instance resource

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to add availability zones to a Gnocchi instance and consume the consuming instance.create.end events.

Add an availability zone to a Gnocchi instance resource:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In /openstack/telemetry.yml, set the create_resources parameter to True:

    ceilometer:
      server:
        publisher:
          gnocchi:
            enabled: True
            create_resources: True
    
  3. From the Salt Master node, apply the following state:

    salt -C 'I@ceilometer:server' saltutil.refresh_pillar
    salt -C 'I@ceilometer:server' state.apply ceilometer.server
    

Migrate from GlusterFS to rsync for fernet and credential keys rotation

By default, the latest MCP deployments use rsync for fernet and credential keys rotation. Though, if your MCP version is 2018.8.0 or earlier, GlusterFS is used as a default rotation driver and credential keys rotation driver. This section provides an instruction on how to configure your MCP OpenStack deployment to use rsync with SSH instead of GlusterFS.

To migrate from GlusterFS to rsync:

  1. Log in to the Salt Master node.

  2. On the system level, verify that the following class is included in keystone/server/cluster.yml:

    - system.keystone.server.fernet_rotation.cluster
    

    Note

    The default configuration for the system.keystone.server.fernet_rotation.cluster class is defined in keystone/server/fernet_rotation/cluster.yml. It includes the default list of nodes to synchronize fernet and credential keys that are sync_node01 and sync_node02. If there are more nodes to synchronize fernet and credential keys, expand this list as required.

  3. Verify that the crontab job is disabled in the keystone/client/core.yml and keystone/client/single.yml system-level files:

    linux:
      system:
        job:
          keystone_job_rotate:
            command: '/usr/bin/keystone-manage fernet_rotate --keystone-user keystone --keystone-group keystone >> /var/log/key_rotation_log 2>> /var/log/key_rotation_log'
            enabled: false
            user: root
            minute: 0
    
  4. Apply the Salt orchestration state to configure all required prerequisites like creating an SSH public key, uploading it to mine and secondary control nodes:

    salt-run state.orchestrate keystone.orchestrate.deploy
    
  5. Apply the keystone.server state to put the Keystone rotation script and run it in the sync mode hence fernet and credential keys will be synchronized with the Keystone secondary nodes:

    salt -C 'I@keystone:server:role:primary' state.apply keystone.server
    salt -C 'I@keystone:server' state.apply keystone.server
    
  6. Apply the linux.system state to add crontab jobs for the Keystone user:

    salt -C 'I@keystone:server' state.apply linux.system
    
  7. On all OpenStack Controller nodes:

    1. Copy the current credential and fernet keys to temporary directories:

      mkdir /tmp/keystone_credential /tmp/keystone_fernet
      cp /var/lib/keystone/credential-keys/* /tmp/keystone_credential
      cp /var/lib/keystone/fernet-keys/* /tmp/keystone_fernet
      
    2. Unmount the related GlusterFS mount points:

      umount /var/lib/keystone/credential-keys
      umount /var/lib/keystone/fernet-keys
      
    3. Copy the keys from the temporary directories to var/lib/keystone/credential-keys/ and /var/lib/keystone/fernet-keys/:

      mkdir -p /var/lib/keystone/credential-keys/ /var/lib/keystone/fernet-keys/
      cp /tmp/keystone_credential/* /var/lib/keystone/credential-keys/
      cp /tmp/keystone_fernet/* /var/lib/keystone/fernet-keys/
      chown -R keystone:keystone /var/lib/keystone/credential-keys/*
      chown -R keystone:keystone /var/lib/keystone/fernet-keys/*
      
  8. On a KVM node, stop and delete the keystone-credential-keys and keystone-keys volumes:

    1. Stop the volumes:

      gluster volume stop keystone-credential-keys
      gluster volume stop keystone-keys
      
    2. Delete the GlusterFS volumes:

      gluster volume delete keystone-credential-keys
      gluster volume delete keystone-keys
      
  9. On the cluster level model, remove the following GlusterFS classes included in the openstack/control.yml file by default:

    - system.glusterfs.server.volume.keystone
    - system.glusterfs.client.volume.keystone
    

Disable the Memcached listener on the UDP port

Starting from the Q4’18 MCP release, to reduce the attack surface and increase the product security, Memcached on the controller nodes listens on TCP only. The UDP port for Memcached is disabled by default. This section explains how to disable the UDP listeners for the existing OpenStack environments deployed on top of the earlier MCP versions.

To disable the Memcached listener on the UDP port:

  1. Log in to the Salt Master node.

  2. Update your Reclass metadata model.

  3. Verify the memcached:server pillar:

    salt ctl01* pillar.get memcached:server
    

    The memcached:server:bind:proto pillar should be available after update of the Reclass metadata model and set to False for proto:udp:enabled for all Memcached server instances.

    Example of system response:

    -- start output --
      ----------
      bind:
          ----------
          address:
              0.0.0.0
          port:
              11211
          proto:
              ----------
              tcp:
                  ----------
                  enabled:
                      True
              udp:
                  ----------
                  enabled:
                      False
          protocol:
              tcp
      enabled:
          True
      maxconn:
          8192
    -- end output --
    
  4. Run the memcached.server state to apply the changes to all memcached instances:

    salt -C 'I@memcached:server' state.sls memcached.server
    

Configuring rate limiting with NGINX

MCP enables you to limit the number of HTTP requests that a user can make in a given period of time for your OpenStack deployments. The rate limiting with NGINX can be used to protect an OpenStack environment against DDoS attacks as well as to protect the community application servers from being overwhelmed by too many user requests at the same time.

For rate limiting configuration, MCP supports the following NGINX modules:

  • ngx_http_geo_module

  • ngx_http_map_module

  • ngx_http_limit_req_module

  • ngx_http_limit_conn_module

This section provides the related NGINX directives description with the configuration samples which you can use to enable rate limiting in your MCP OpenStack deployment.

NGINX rate limiting configuration sample

This section includes the configuration sample of NGINX rate limiting feature that enables you to limit the number of HTTP requests a user can make in a given period of time.

In the sample, all clients except for 10.12.100.1 are limited to 1 request per second. More specifically, the sample illustrates how to:

  • Create a geo instance that will match the IP address and set the limit_action variable where 0 stands for unlimited and 1 stands for limited.

  • Create global_geo_limiting_map that will map ip_limit_key to ip_limit_action.

  • Create a global limit_req_zone zone called global_limit_zone that limits the number of requests to 1 request per second.

  • Apply global_limit_zone globally to all requests with 5 requests burst and nodelay.

Configuration sample:

nginx:
  server:
    enabled: true
    geo:
      enabled: true
      items:
        global_geo_limiting:
          enabled: true
          variable: ip_limit_key
          body:
            default:
              value: '1'
            unlimited_client1:
              name: '10.12.100.1/32'
              value: '0'
    map:
      enabled: true
      items:
        global_geo_limiting_map:
          enabled: true
          string: ip_limit_key
          variable: ip_limit_action
          body:
            limited:
              name: 1
              value: '$binary_remote_addr'
            unlimited:
              name: 0
              value: '""'
    limit_req_module:
      limit_req_zone:
        global_limit_zone:
          key: ip_limit_action
          size: 10m
          rate: '1r/s'
      limit_req_status: 503
      limit_req:
         global_limit_zone:
           burst: 5
           enabled: true

To apply the request limiting to a particular site, define the limit_req on a site level. For example:

nginx:
  server:
    site:
      nginx_proxy_openstack_api_keystone:
        limit_req_module:
          limit_req:
            global_limit_zone:
              burst: 5
              nodelay: true
              enabled: true

Configuring the geo module

The ngx_http_geo_module module creates variables with values depending on the client IP address.

Syntax

geo [$address] $variable { ... }

Default

Context

HTTP

NGINX configuration sample

geo $my_geo_map {
    default        0;
    127.0.0.1      0;
    10.12.100.1/32 1;
    10.13.0.0/16    2;
    2001:0db8::/32 1;
}

Example of a Salt pillar for the geo module:

nginx:
server:
  geo:
    enabled: true
    items:
      my_geo_map:
        enabled: true
        variable: my_geo_map_variable
        body:
          default:
            value: '0'
          localhost:
            name: 127.0.0.1
            value: '0'
          client:
            name: 10.12.100.1/32
            value: '1'
          network:
            name: 10.13.0.0/16
            value:  '2'
          ipv6_client:
            name: 2001:0db8::/32
            value: '1'

All geo variables specified in the pillars, after applying the nginx.server state, will be reflected in the /etc/nginx/conf.d/geo.conf file.

Configuring the mapping

The ngx_http_map_module module creates variables which values depend on values of other source variables specified in the first parameter.

Syntax

map string $variable { ... }

Default

Context

HTTP

NGINX configuration sample

map $my_geo_map_variable $ip_limit_action {
    default "";
    1 $binary_remote_addr;
    0 "";
}

Example of a Salt pillar for the map module:

nginx:
  server:
    map:
      enabled: true
      items:
        global_geo_limiting_map:
          enabled: true
          string: my_geo_map_variable
          variable: ip_limit_action
          body:
            default:
              value: '""'
            limited:
              name: '1'
              value: '$binary_remote_addr'
            unlimited:
              name: '0'
              value: '""'

All map variables specified in the pillars, after applying the nginx.server state, will be reflected in the /etc/nginx/conf.d/map.conf file.

Configuring the request limiting

The ngx_http_limit_req_module module limits the request processing rate per a defined key. The module directives include the mandatory limit_req_zone and limit_req directives and an optional limit_req_status directive.

The limit_req_zone directive defines the parameters for the rate limiting.

Syntax

limit_req_zone key zone=name:size rate=rate [sync];

Default

Context

HTTP

NGINX configuration sample

limit_req_zone $binary_remote_addr zone=global_limit_zone1:10m rate=1r/s ;
limit_req_zone $ip_limit_action zone=global_limit_zone2:10m rate=2r/s ;

The limit_req directive enables rate limiting within the context where it appears.

Syntax

limit_req zone=name [burst=number] [nodelay | delay=number];

Default

Context

HTTP, server, location

NGINX configuration sample

limit_req zone=global_limit_zone1 burst=2 ;
limit_req zone=global_limit_zone2 burst=4 nodelay ;

The limit_req_status directive sets the status code to return in response to rejected requests.

Syntax

limit_req_status code;

Default

limit_req_status 503;

Context

http, server, location that corresponds to the nginx:server and nginx:server:site definitions of a pillar.

NGINX configuration sample

limit_req_status 429;

Example of a Salt pillar for limit_req_zone and limit_req:

nginx:
server:
  limit_req_module:
    limit_req_zone:
      global_limit_zone1:
        key: binary_remote_addr
        size: 10m
        rate: '1r/s'
      global_limit_zone2:
        key: ip_limit_action
        size: 10m
        rate: '2r/s'
    limit_req_status: 429
    limit_req:
      global_limit_zone1:
        burst: 2
        enabled: true
      global_limit_zone2:
        burst: 4
        enabled: true
        nodelay: true

In the configuration example above, the states are kept in a 10 megabyte global_limit_zone1 and global_limit_zone2 zones. An average request processing rate cannot exceed 1 request per second for global_limit_zone1 and 2 requests per second for global_limit_zone2.

The $binary_remote_addr, a client’s IP address, serves as a key for the global_limit_zone1 zone. And the mapped $ip_limit_action variable is a key for the global_limit_zone2 zone.

To apply the request limiting to a particular site, define the limit_req on a site level. For example:

nginx:
  server:
    site:
      nginx_proxy_openstack_api_keystone:
        limit_req_module:
          limit_req:
            global_limit_zone:
              burst: 5
              nodelay: true
              enabled: true

Configuring the connection limiting

The ngx_http_limit_conn_module module limits the number of connections per defined key. The main directives include limit_conn_zone and limit_conn.

The limit_conn_zone directive sets parameters for a shared memory zone that keeps states for various keys. A state is the current number of connections. The key value can contain text, variables, and their combination. The requests with an empty key value are not accounted.

Syntax

limit_conn_zone key zone=name:size;

Default

Context

HTTP

NGINX configuration sample

limit_conn_zone $binary_remote_addr zone=global_limit_conn_zone:20m;
limit_conn_zone $binary_remote_addr zone=openstack_web_conn_zone:10m;

The limit_conn directive sets the shared memory zone and the maximum allowed number of connections for a given key value. When this limit is exceeded, the server returns the error in reply to a request.

Syntax

limit_conn zone number;

Default

Context

HTTP, server, location

NGINX configuration sample

limit_conn global_limit_conn_zone 100;
limit_conn_status 429;

Example of a Salt pillar with limit_conn_zone and limit_conn:

nginx:
  server:
    limit_conn_module:
      limit_conn_zone:
        global_limit_conn_zone:
          key: 'binary_remote_addr'
          size: 20m
          enabled: true
        api_keystone_conn_zone:
          key: 'binary_remote_addr'
          size: 10m
          enabled: true
      limit_conn:
        global_limit_conn_zone:
          connections: 100
          enabled: true
      limit_conn_status: 429

To apply the connection limiting to a particular site, define limit_conn on a site level. For example:

nginx:
  server:
    site:
      nginx_proxy_openstack_api_keystone:
        limit_conn_module:
          limit_conn_status: 429
          limit_conn:
            api_keystone_conn_zone:
              connections: 50
              enabled: true

Configure load balancing for Horizon

Starting from the Q4’18 MCP version, Horizon works in the load balancing mode by default. All requests to Horizon are terminated and forwarded to the Horizon back end by HAProxy bound on a virtual IP address. HAProxy serves as a balancer and manages requests according to the defined policy, which is round-robin by default, among proxy nodes. This approach allows for load reduction on one proxy node and spreading the load among all proxy nodes.

Note

If the node, which the user is connected to, has failed and the user is reconnected to another node, the user will be logged out from the dashboard. As a result, the The user is not authorized page opens, which is the expected behavior in this use case. To continue working with the dashboard, the user has to sign in to Horizon again from the Log In page.

This section provides the instruction on how to manually configure Horizon load balancing for the existing OpenStack deployments that are based on earlier MCP release versions.

To enable active-active mode for Horizon:

  1. Log in to the Salt Master node.

  2. Update to the 2019.2.0 Build ID MCP version or higher.

  3. Verify that the system.apache.server.site.horizon class has been added to your Reclass model. By default, the class is defined in the ./system/apache/server/site/horizon.yml file on the Reclass system level as follows:

    parameters:
      _param:
        apache_ssl:
          enabled: false
        apache_horizon_ssl: ${_param:apache_ssl}
        apache_horizon_api_address: ${_param:horizon_server_bind_address}
        apache_horizon_api_host: ${linux:network:fqdn}
      apache:
        server:
          bind:
            listen_default_ports: false
          enabled: true
          default_mpm: event
          modules:
            - wsgi
          site:
            horizon:
              enabled: false
              available: true
              type: wsgi
              name: openstack_web
              ssl: ${_param:apache_horizon_ssl}
              wsgi:
                daemon_process: horizon
                processes: 3
                threads: 10
                user: horizon
                group: horizon
                display_name: '%{GROUP}'
                script_alias: '/ /usr/share/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi'
                application_group: '%{GLOBAL}'
                authorization: 'On'
              limits:
                request_body: 0
              host:
                address: ${_param:apache_horizon_api_address}
                name: ${_param:apache_horizon_api_host}
                port: 8078
              locations:
                - uri: /static
                  path: /usr/share/openstack-dashboard/static
              directories:
                dashboard_static:
                  path: /usr/share/openstack-dashboard/static
                  order: 'allow,deny'
                  allow: 'from all'
                  modules:
                    mod_expires.c:
                      ExpiresActive: 'On'
                      ExpiresDefault: '"access 6 month"'
                    mod_deflate.c:
                      SetOutputFilter: 'DEFLATE'
                dashboard_wsgi:
                  path: /usr/share/openstack-dashboard/openstack_dashboard/wsgi
                  order: 'allow,deny'
                  allow: 'from all'
              log:
                custom:
                  format: >-
                    %v:%p %{X-Forwarded-For}i %h %l %u %t \"%r\" %>s %D %O \"%{Referer}i\" \"%{User-Agent}i\"
                error:
                  enabled: true
                  level: debug
                  format: '%M'
                  file: '/var/log/apache2/openstack_dashboard_error.log'
    
  4. Verify that the system.apache.server.site.horizon has been added to the Reclass system level in the ./system/horizon/server/single.yml file as follows:

    classes:
      - service.horizon.server.single
      - system.horizon.upgrade
      - system.horizon.server.iptables
      - system.apache.server.single
      - system.memcached.server.single
      - system.apache.server.site.horizon
    
  5. Verify that the definition for the system.haproxy.proxy.listen.openstack.openstack_web class has been added to the Reclass cluster level in the in the proxy nodes configuration file:

    parameters:
      _param:
        haproxy_openstack_web_check_params: check
      haproxy:
        proxy:
          listen:
            openstack_web:
              type: custom
              check: false
              sticks: ${_param:haproxy_openstack_web_sticks_params}
              binds:
              - address: ${_param:cluster_vip_address}
                port: ${_param:haproxy_openstack_web_bind_port}
              servers:
              - name: ${_param:cluster_node01_hostname}
                host: ${_param:cluster_node01_address}
                port: 8078
                params: ${_param:haproxy_openstack_web_check_params}
              - name: ${_param:cluster_node02_hostname}
                host: ${_param:cluster_node02_address}
                port: 8078
                params: ${_param:haproxy_openstack_web_check_params}
    
  6. Add the system.haproxy.proxy.listen.openstack.openstack_web class to the Horizon node configuration file, for example, cluster/<cluster_name>/openstack/dashboard.yml:

    classes:
      - system.haproxy.proxy.listen.openstack.openstack_web
    
  7. In the Horizon node configuration file (edited in the previous step), define the host names and IP addresses for all proxy nodes used in the deployment for the dashboard node and verify that the HAProxy checks the availability of Horizon.

    Congifuration example for two proxy nodes:

    parameters:
      _param:
        cluster_node01_hostname: ${_param:openstack_proxy_node01_hostname}
        cluster_node01_address: ${_param:openstack_proxy_node01_address}
        cluster_node02_hostname: ${_param:openstack_proxy_node02_hostname}
        cluster_node02_address: ${_param:openstack_proxy_node02_address}
        haproxy_openstack_web_bind_port: ${_param:horizon_public_port}
        haproxy_openstack_web_check_params: check inter 10s fastinter 2s downinter 3s rise 3 fall 3 check-ssl verify none
      horizon:
        server:
          cache:
            ~members:
            - host: ${_param:openstack_proxy_node01_address}
              port: 11211
            - host: ${_param:openstack_proxy_node02_address}
              port: 11211
    
  8. If the HTTP to HTTPS redirection will be used, add the following configuration to the Horizon node configuration file:

    parameters:
      haproxy:
        proxy:
          listen:
            openstack_web_proxy:
              mode: http
              format: end
              force_ssl: true
              binds:
              - address: ${_param:cluster_vip_address}
                port: 80
    
  9. Disable the NGINX servers requests for Horizon by replacing the NGINX class with the HAProxy class in the proxy node configuration file.

    Replace:

    - system.nginx.server.proxy.openstack_web
    

    with

    - system.haproxy.proxy.single
    
  10. Remove the nginx_redirect_openstack_web_redirect.conf and nginx_proxy_openstack_web.conf Horizon sites from /etc/nginx/sites-enabled/.

  11. Restart the NGINX service on the proxy nodes:

    salt 'prx*' cmd.run 'systemctl restart nginx'
    
  12. Verify that Keepalived keeps track on HAProxy by adding the haproxy variable for the keepalived_vrrp_script_check_multiple_processes parameter:

    parameters:
      _param:
        keepalived_vrrp_script_check_multiple_processes: 'nginx haproxy'
    
  13. Enable SSL for Horizon:

    parameters:
      _param:
        apache_ssl:
          enabled: true
          authority: ${_param:salt_minion_ca_authority}
          engine: salt
          key_file:  /srv/salt/pki/${_param:cluster_name}/${salt:minion:cert:proxy:common_name}.key
          cert_file: /srv/salt/pki/${_param:cluster_name}/${salt:minion:cert:proxy:common_name}.crt
          chain_file: /srv/salt/pki/${_param:cluster_name}/${salt:minion:cert:proxy:common_name}-with-chain.crt
    
  14. Define the address to be bound by Memcached in the cluster/<cluster_name>/openstack/proxy.yml file:

    parameters:
      _param:
        openstack_memcached_server_bind_address: ${_param:single_address}
    
  15. Verify that the Horizon Salt formula is updated to the the version higher than 2016.12.1+201812072002.e40b950 and the Apache Salt formula is updated to the version higher than 0.2+201811301717.acb3391.

  16. Delete the NGINX sites from the proxy nodes that proxy Horizon requests and possible redirection from HTTP to HTTPS.

  17. Apply the haproxy and horizon states on the proxy nodes:

    salt -C 'I@horizon:server' state.sls horizon
    salt -C 'I@horizon:server' state.sls haproxy
    

Expose a hardware RNG device to Nova instances

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

MCP enables you to define the path to an Random Number Generator (RNG) device that will be used as the source of entropy on the host. The default source of entropy is /dev/urandom. Other available options include /dev/random and /dev/hwrng.

The example structure of the RNG definition in the Nova pillar:

nova:
  controller:
    libvirt:
      rng_dev_path: /dev/random

  compute:
    libvirt:
      rng_dev_path: /dev/random

The procedure included in this section can be used for both existing and new MCP deployments.

To define the path to an RNG device:

  1. Log in to the Salt Master node.

  2. In classes/cluster/<cluster_name>/openstack/control.yml, define the rng_dev_path parameter for nova:contoroller:

    nova:
      controller:
        libvirt:
          rng_dev_path: /dev/random
    
  3. In classes/cluster/<cluster_name>/openstack/compute/init.yml, define the rng_dev_path parameter for nova:compute:

    nova:
      compute:
        libvirt:
          rng_dev_path: /dev/random
    
  4. Apply the changes:

    salt -C 'I@nova:controller' state.sls nova.controller
    salt -C 'I@nova:compute' state.sls nova.compute
    

Set the directory for lock files

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can set the directory for lock files for the Ceilometer, Cinder, Designate, Glance, Ironic, Neutron, and Nova OpenStack services by specifying the lock_path parameter in the Reclass model. This section provides the example of the lock path configuration for Nova.

To set the lock path for Nova:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. Define the lock_path parameter:

    1. In openstack/control.yml, specify:

      parameters:
        nova:
          controller:
            concurrency:
              lock_path: '/var/lib/nova/tmp'
      
    2. In openstack/compute.yml, specify:

      parameters:
        nova:
          compute:
            concurrency:
              lock_path: '/var/lib/nova/tmp'
      
  3. Apply the changes from the Salt Master node:

    salt -C 'I@nova:controller or I@nova:compute' saltutil.refresh_pillar
    salt -C 'I@nova:controller' state.apply nova.controller
    salt -C 'I@nova:compute' state.apply nova.compute
    

Add the Nova CpuFlagsFilter custom filter

Note

This feature is available starting from the MCP 2019.2.10 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

CpuFlagsFilter is a custom Nova scheduler filter for live migrations. The filter ensures that the CPU features of a live migration source host match the target host. Use the CpuFlagsFilter filter only if your deployment meets the following criteria:

  • The CPU mode is set to host-passthrough or host-model. For details, see MCP Deployment Guide: Configure a CPU model.

  • The OpenStack compute nodes have heterogeneous CPUs.

  • The OpenStack compute nodes are not organized in aggregates with the same CPU in each aggregate.

To add the Nova CpuFlagsFilter custom filter:

  1. Open your project Git repository with the Reclass model on the cluster level.

  2. Open the classes/cluster/<cluster_name>/openstack/control.yml file for editing.

  3. Verify that the cpu_mode parameter is set to host-passthrough or host-model.

  4. Add CpuFlagsFilter to the scheduler_default_filters parameter for nova:contoroller:

    nova:
      controller:
        scheduler_default_filters: “DifferentHostFilter,SameHostFilter,RetryFilter,AvailabilityZoneFilter,RamFilter,CoreFilter,DiskFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,NUMATopologyFilter,AggregateInstanceExtraSpecsFilter,CpuFlagsFilter”
    
  5. Log in to the Salt Master node.

  6. Apply the changes:

    salt -C 'I@nova:controller' state.sls nova.controller
    

Kubernetes operations

Caution

Kubernetes support termination notice

Starting with the MCP 2019.2.5 update, the Kubernetes component is no longer supported as a part of the MCP product. This implies that Kubernetes is not tested and not shipped as an MCP component. Although the Kubernetes Salt formula is available in the community driven SaltStack formulas ecosystem, Mirantis takes no responsibility for its maintenance.

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

This section includes topics that describe operations with your Kubernetes environment.

Monitor connectivity between the Kubernetes nodes using Netchecker

Caution

Kubernetes support termination notice

Starting with the MCP 2019.2.5 update, the Kubernetes component is no longer supported as a part of the MCP product. This implies that Kubernetes is not tested and not shipped as an MCP component. Although the Kubernetes Salt formula is available in the community driven SaltStack formulas ecosystem, Mirantis takes no responsibility for its maintenance.

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

The Mirantis Cloud Platform automatically deploys Netchecker as part of an MCP Kubernetes Calico-based deployment. Netchecker enables network connectivity and network latency monitoring for the Kubernetes nodes.

This section includes topics that describe how to configure and use Netchecker.

View Netchecker metrics

MCP automatically configures Netchecker during the deployment of the Kubernetes cluster. Therefore, Netchecker starts gathering metrics as soon as the Kubernetes cluster is up and running. You can view Netchecker metrics to troubleshoot connectivity between the Kubernetes nodes.

To view Netchecker metrics:

  1. Log in to the Kubernetes Master node.

  2. Obtain the IP address of the Netchecker server pod:

    kubectl get pod -o json --selector='app==netchecker-server' -n \
    netchecker | grep podIP
    
  3. Obtain the Netchecker container port number:

    kubectl get pod -o json --selector='app==netchecker-server' -n \
    netchecker | grep containerPort
    
  4. View all metrics provided by Netchecker:

    curl <netchecker-pod-ip>:<port>/metrics
    
  5. View the list of Netchecker agents metrics:

    curl <netchecker-pod-ip>:<port>/metrics | grep ncagent
    

    Example of system response:

    # HELP ncagent_error_count_total Total number of errors (keepalive miss
    # count) for the agent.
    # TYPE ncagent_error_count_total counter
    ncagent_error_count_total{agent="cmp01-private_network"} 0
    ncagent_error_count_total{agent="cmp02-private_network"} 0
    ncagent_error_count_total{agent="ctl01-private_network"} 0
    ncagent_error_count_total{agent="ctl02-private_network"} 0
    ncagent_error_count_total{agent="ctl03-private_network"} 0
    
    ...
    

    For the list of Netchecker metrics, see: Netchecker metrics description.

Netchecker metrics description

The following table lists Netchecker metrics. The metrics with the ncagent_ prefix are used to monitor the Kubernetes environment.

Netchecker metrics

Metric

Description

go_*

A set of default golang metrics provided by the Prometheus library.

process_*

A set of default process metrics provided by the Prometheus library.

ncagent_report_count_total (label agent)

A counter that calculates the number of total reports from every Netchecker agent separated by label.

ncagent_error_count_total (label agent)

A counter that calculates the number of total errors from every agent separated by label. Netchecker increases the value of the counter each time the Netchecker agent fails to send a report within the reporting_interval * 2 timeframe.

ncagent_http_probe_connection_result

A gauge that represents the connection result between an HTTP server and a Netchecker agent. Possible values: 0 - error, 1 - success.

ncagent_http_probe_code

A gauge that represents the HTTP status code. Returns 0 if there is no HTTP response.

ncagent_http_probe_total_time_ms

A gauge that represents the total duration of an HTTP transaction.

ncagent_http_probe_content_transfer_time_ms

A gauge that represents the duration of content transfer from the first response byte till the end (in ms).

ncagent_http_probe_tcp_connection_time_ms

A gauge that represents the TCP connection establishing time in ms.

ncagent_http_probe_dns_lookup_time_ms

A gauge that represents the DNS lookup time in ms.

ncagent_http_probe_connect_time_ms

A gauge that represents connection time in ms.

ncagent_http_probe_server_processing_time_ms

A gauge that represents the server processing time in ms.

Transition to containers

Caution

Kubernetes support termination notice

Starting with the MCP 2019.2.5 update, the Kubernetes component is no longer supported as a part of the MCP product. This implies that Kubernetes is not tested and not shipped as an MCP component. Although the Kubernetes Salt formula is available in the community driven SaltStack formulas ecosystem, Mirantis takes no responsibility for its maintenance.

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

Transitioning from virtual machines to containers is a lengthy and complex process that in some environments may take years. If you want to leverage Kubernetes features while you continue using the existing applications that run in virtual machines, Mirantis provides an interim solution that enables running virtual machines orchestrated by Kubernetes.

To enable Kubernetes to run virtual machines, you need to deploy and configure a virtual machine runtime for Kubernetes called Virtlet. Virtlet is a Kubernetes Container Runtime Interface (CRI) implementation that is packaged as a Docker image and contains such components as libvirt daemon, QEMU/KVM wrapper, and so on.

Virtlet enables you to run unmodified QEMU/KVM virtual machines that do not include an additional containerd layer as in similar solutions in Kubernetes. Virtlet supports all standard Kubernetes objects, such as ReplicaSets, deployments, DaemonSets, and so on, as well as their operations. For information on operations with Kubernetes objects, see: Kubernetes documentation.

Unmodified QEMU/KVM virtual machines enable you to run:

  • Unikernels

  • Applications that are hard to containerize

  • NFV workloads

  • Legacy applications

Compared to regular Kubernetes pods, Virtlet pods have the following limitations:

  • Only one virtual machine per pod is allowed.

  • Virtual machine volumes (pod volumes) must be specified using the FlexVolume driver. Standard Kubernetes directory-based volumes are not supported except for the use case of Kubernetes Secrets and ConfigMaps. If a Secret or a ConfigMap is mounted to a VM pod, its content is copied into an appropriate location inside the VM using the cloud-init mechanism.

  • No support for kubectl exec.

For details on Virtlet operations, see: Virtlet documentation.

This section describes how to create and configure Virtlet pods as well as provides examples of pods for different services.

For an instruction on how to update Virtlet, see Update Virtlet.

Prerequisites

To have a possibility to run virtual machines as Kubernetes pods, your environment must meet the following prerequisites:

  • An operational Kubernetes environment with enabled Virtlet functionality.

  • SELinux and AppArmor must be disabled on the Kubernetes nodes.

  • The Kubernetes node names must be resolvable by the DNS server configured on the Kubernetes nodes.

Example of a pod configuration

You need to define a pod for each virtual machine that you want to place under Kubernetes orchestration.

Pods are defined as .yaml files. The following text is an example of a pod configuration for a VM with Virtlet:

apiVersion: v1
kind: Pod
metadata:
  name: cirros-vm
  annotations:
    kubernetes.io/target-runtime: virtlet
    VirtletVCPUCount: "1"
    VirtletSSHKeys: |
      ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCaJEcFDXEK2Zb
      X0ZLS1EIYFZRbDAcRfuVjpstSc0De8+sV1aiu+dePxdkuDRwqFt
      Cyk6dEZkssjOkBXtri00MECLkir6FcH3kKOJtbJ6vy3uaJc9w1E
      Ro+wyl6SkAh/+JTJkp7QRXj8oylW5E20LsbnA/dIwWzAF51PPwF
      7A7FtNg9DnwPqMkxFo1Th/buOMKbP5ZA1mmNNtmzbMpMfJATvVy
      iv3ccsSJKOiyQr6UG+j7sc/7jMVz5Xk34Vd0l8GwcB0334MchHc
      kmqDB142h/NCWTr8oLakDNvkfC1YneAfAO41hDkUbxPtVBG5M/o
      7P4fxoqiHEX+ZLfRxDtHB53 me@localhost
    VirtletCloudInitUserDataScript: |
      #!/bin/sh
      echo "Hi there"
spec:
 affinity:
   nodeAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       nodeSelectorTerms:
       - matchExpressions:
         - key: extraRuntime
           operator: In
           values:
           - virtlet
containers:
- name: cirros-vm
  image: virtlet/download.cirros-cloud.net/0.3.5/
  cirros-0.3.5-x86_64-disk.img
  resources:
    limits:
      memory: 128Mi

The following table describes the pod configuration parameters:

Pod definition paramaters

Parameter

Description

apiVersion

Version of the Kubernetes API.

kind

Type of the file. For all pod configurations, the kind parameter value is Pod.

metadata

Specifies a number of parameters required for the pod configuration, including:

  • name - the name of the pod.

  • annotations - a subset of metadata parameters in the form of strings. Numeric values must be quoted:

    • kubernetes.io/target-runtime - defines that this pod belongs to the Virtlet runtime.

    • VirtletVCPUCount - (optional) specifies the number of virtual CPUs. The default value is 1.

    • VirtletSSHKeys - one or many SSH keys, one key per line.

    • VirtletCloudInitUserDataScript - user data for the cloud-init script.

spec

Pod specification, including:

  • nodeAffinity - the specification in the example above ensures that Kubernetes runs this pod only on the nodes that have the extraRuntime=virtlet label. This label is used by the Virtlet DaemonSet to select nodes that must have the Virtlet runtime.

  • containers - a container configuration that includes:

    • name - name of the container.

    • image - specifies the path to a network location where the Docker image is stored. The path must start with the virtlet prefix followed by the URL to the required location.

    • resources - defines the resources, such as memory limitation for the libvirt domain.

Example of a pod definition for an ephemeral device

Virtlet stores all ephemeral volumes in the local libvirt pool storage in var/lib/virtlet/volumes. The volume configuration is defined under the volume section.

The following text is an example of a pod (virtual machine) definition with an ephemeral volume of 2048 MB capacity.

 apiVersion: v1
 kind: Pod
 metadata:
   name: test-vm-pod
   annotations:
     kubernetes.io/target-runtime: virtlet
 spec:
   affinity:
   nodeAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       nodeSelectorTerms:
       - matchExpressions:
         - key: extraRuntime
           operator: In
           values:
           - virtlet
   containers:
   - name: test-vm
     image: download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
     volumeMounts:
  - name: containerd
    mountPath: /var/lib/containerd
volumes:
- name: containerd
  flexVolume:
    driver: "virtlet/flexvolume_driver"
    options:
      type: qcow2
      capacity: 2048MB

Example of a pod definition for a Ceph RBD

If your virtual machines store data in Ceph, you can attach Ceph RBDs to the virtual machines under Kubernetes control by specifying the required RBDs in the virtual machine pod definition. You do not need to mount the devices in the container.

Virtlet supports the following options for Ceph RBD devices:

Option

Paramater

FlexVolume driver

kubernetes.io/flexvolume_driver

Type

ceph

Monitor

ip:port

User

user-name

Secret

user-secret-key

Volume

rbd-image-name

Pool

pool-name

The following text is an example of a virtual machine pod definition with one Ceph RBD volume:

apiVersion: v1
kind: Pod
metadata:
  name: cirros-vm-rbd
  annotations:
    kubernetes.io/target-runtime: virtlet
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: extraRuntime
            operator: In
            values:
          - virtlet
containers:
  - name: cirros-vm-rbd
    image: virtlet/image-service.kube-system/cirros
    volumeMounts:
    - name: test
      mountPath: /testvol
volumes:
  - name: test
    flexVolume:
      driver: kubernetes.io/flexvolume_driver
      options:
        Type: ceph
        Monitor: 10.192.0.1:6789
        User: libvirt
        Secret: AQDTwuVY8rA8HxAAthwOKaQPr0hRc7kCmR/9Qg==
        Volume: rbd-test-image
        Pool: libvirt-pool

Example of a pod definition for a block device

If you want to mount a block device that is available in the /dev/ directory on the Kubernetes node, you can specify the raw device in the pod definition.

The following text is an example of pod definition with one block device. In this example, the path to the raw device is /dev/loop0 which means that a disk is associated with the path on the Virtlet node.

apiVersion: v1
kind: Pod
metadata:
  name: test-vm-pod
  annotations:
    kubernetes.io/target-runtime: virtlet
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
        - key: extraRuntime
          operator: In
          values:
          - virtlet
containers:
  - name: test-vm
    image: download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
    volumeMounts:
  - name: raw
    mountPath: /rawvol
volumes:
- name: raw
  flexVolume:
    driver: "virtlet/flexvolume_driver"
    options:
      type: raw
      path: /dev/loop0

Reprovision the Kubernetes Master node

Caution

Kubernetes support termination notice

Starting with the MCP 2019.2.5 update, the Kubernetes component is no longer supported as a part of the MCP product. This implies that Kubernetes is not tested and not shipped as an MCP component. Although the Kubernetes Salt formula is available in the community driven SaltStack formulas ecosystem, Mirantis takes no responsibility for its maintenance.

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

If the Kubernetes Master node became non-operational and recovery is not possible, you can reprovision the node from scratch.

When reprovisioning a node, you can not update some of the configuration data:

  • Hostname and FQDN - because it breaks Calico.

  • Node role - for example, from Kubernetes Master to Node role. However, you can use the kubectl label node command to reset a node labels later.

  • Network plugin - for example, from Calico to Weave.

You can change the following information:

  • Host IP(s)

  • MAC addresses

  • Operating system

  • Application certificates

Caution

All Master nodes must serve the same apiserver certificate. Otherwise, service tokens will become invalidated.

To reprovision the Kubernetes Master node:

  1. Verify that MAAS works properly and provides the DHCP service to assign an IP address and bootstrap an instance.

  2. Verify that the target nodes have connectivity with the Salt Master node:

    salt 'ctl[<NUM>]*' test.ping
    
  3. Update modules and states on the new Minion of the Salt Master node:

    salt 'ctl[<NUM>]*' saltutil.sync_all
    

    Note

    The ctl[<NUM>] parameter is the ID of a failed Kubernetes Master node.

  4. Create and distribute SSL certificates for services using the salt state:

    salt 'ctl[<NUM>]*' state.sls salt
    
  5. Install Keepalived:

    salt 'ctl[<NUM>]*' state.sls keepalived -b 1
    
  6. Install HAProxy and verify its status:

    salt 'ctl[<NUM>]*' state.sls haproxy
    salt 'ctl[<NUM>]*' service.status haproxy
    
  7. Install etcd and verify the cluster health:

    salt 'ctl[<NUM>]*' state.sls etcd.server.service
    salt 'ctl[<NUM>]*' cmd.run "etcdctl cluster-health"
    

    Install etcd with the SSL support:

    salt 'ctl[<NUM>]*' state.sls salt.minion.cert,etcd.server.service
    salt 'ctl[<NUM>]*' cmd.run '. /var/lib/etcd/configenv && etcdctl cluster-health'
    
  8. Install Kubernetes:

    salt 'ctl[<NUM>]*' state.sls kubernetes.master.kube-addons
    salt 'ctl[<NUM>]*' state.sls kubernetes.pool
    
  9. Set up NAT for Calico:

    salt 'ctl[<NUM>]*' state.sls etcd.server.setup
    
  10. Run master to check consistency:

    salt 'ctl[<NUM>]*' state.sls kubernetes exclude=kubernetes.master.setup
    
  11. Register add-ons:

    salt 'ctl[<NUM>]*' --subset 1 state.sls kubernetes.master.setup
    

Kubernetes Nodes operations

Caution

Kubernetes support termination notice

Starting with the MCP 2019.2.5 update, the Kubernetes component is no longer supported as a part of the MCP product. This implies that Kubernetes is not tested and not shipped as an MCP component. Although the Kubernetes Salt formula is available in the community driven SaltStack formulas ecosystem, Mirantis takes no responsibility for its maintenance.

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

This section contains the Kubernetes Nodes-related operations.

Add a Kubernetes Node automatically

MCP DriveTrain enables you to automatically scale up the number of Nodes in your MCP Kubernetes cluster if required.

Note

Currently, only scaling up is supported using the Jenkins pipeline job. Though, you can scale down the number of Kubernetes Nodes manually as described in Remove a Kubernetes Node.

To scale up a Kubernetes cluster:

  1. Log in to the Jenkins web UI as an administrator.

    Note

    To obtain the password for the admin user, run the salt "cid*" pillar.data _param:jenkins_admin_password command from the Salt Master node.

  2. Find the deployment pipeline job that you used to successfully deploy your Kubernetes cluster. For deployment details, refer to the Deploy a Kubernetes cluster procedure. You will reuse the existing deployment pipeline job to scale up your existing Kubernetes cluster.

  3. Select the Build with Parameters option from the drop-down menu of the pipeline job.

  4. Reconfigure the following parameters:

    Kubernetes parameters to scale up the existing cluster

    Parameter

    Description

    STACK_COMPUTE_COUNT

    The number of Kubernetes Nodes to be deployed by the pipeline job. Configure as required for your use case.

    STACK_NAME

    The Heat stack name to reuse.

    STACK_REUSE

    Select to reuse the existing Kubernetes deployment that requires scaling up.

  5. Click Build to launch the pipeline job.

As a result of the deployment pipeline job execution, your existing Kubernetes cluster will be scaled up during the Scale Kubernetes Nodes stage as configured. The preceding stages of the workflow will be executed as well to ensure proper configuration. However, it will take a significantly shorter period of time to execute these stages, as most of the operations have been already performed during the initial cluster deployment.

Add a Kubernetes Node manually

This section describes how to manually add a Kubernetes Node to your MCP cluster to increase the cluster capacity, for example.

To add a Kubernetes Node manually:

  1. Add a physical node using MAAS as described in the MCP Deployment Guide: Provision physical nodes using MAAS.

  2. Log in to the Salt Master node.

  3. Verify that salt-minion is running on the target node and this node appears in the list of the Salt keys:

    salt-key
    

    Example of system response:

    cmp0.bud.mirantis.net
    cmp1.bud.mirantis.net
    cmp2.bud.mirantis.net
    
  4. Apply the Salt states to the target node. For example, to cmp2:

    salt 'cmp2*' saltutil.refresh_pillar
    salt 'cmp2*' saltutil.sync_all
    salt 'cmp2*' state.apply salt
    salt 'cmp2*' state.apply linux,ntp,openssh,git
    salt 'cmp2*' state.sls kubernetes.pool
    salt 'cmp2*' service.restart 'kubelet'
    salt 'cmp2*' state.apply salt
    salt '*' state.apply linux.network.host
    
  5. If Virtlet will run on the target node, add the node label:

    salt -C 'I@kubernetes:master and 01' \
    cmd.run 'kubectl label --overwrite node cmp2 extraRuntime=virtlet'
    
  6. Log in to any Kubernetes Master node.

  7. Verify that the target node appears in the list of the cluster nodes and is in the Ready state:

    kubectl get nodes
    

    Example of system response:

    NAME STATUS ROLES AGE  VERSION
    cmp0 Ready  node  54m  v1.10.3-3+93532daa6d674c
    cmp1 Ready  node  54m  v1.10.3-3+93532daa6d674c
    cmp2 Ready  node  2m   v1.10.3-3+93532daa6d674c
    

Remove a Kubernetes Node

This section describes how to remove a Kubernetes Node from your MCP cluster.

To remove a Kubernetes Node:

  1. Log in to the Kubernetes Node that you want to remove.

  2. Stop and disable the salt-minion service on this node:

    systemctl stop salt-minion
    systemctl disable salt-minion
    
  3. Log in to the Salt Master node.

  4. Verify that the node name is not registered in salt-key. If the node is present, remove it:

    salt-key | grep <node_name><NUM>
    salt-key -d <node_name><NUM>.domain_name
    
  5. Log in to any Kubernetes Master node.

  6. Mark the Node as unschedulable to prevent new pods from being assigned to it:

    kubectl cordon <node_ID>
    kubectl drain <node_ID>
    
  7. Remove the Kubernetes Node:

    kubectl delete node cmp<node_ID>
    

    Wait until the workloads are gracefully deleted and the Kubernetes Node is removed.

  8. Verify that the node is absent in the Kubernetes Nodes list:

    kubectl get nodes
    
  9. Open your Git project repository with Reclass model on the cluster level.

  10. In infra/config.yml, remove the definition of the Kubernetes Node in question under the reclass:storage:node pillar.

  11. Log in to the Kubernetes Node in question.

  12. Run the following commands:

    systemctl stop kubelet
    systemctl disable kubelet
    

Reprovision a Kubernetes Node

You may need to reprovision a failed Kubernetes Node. When reprovisioning a Kubernetes Node, you can not update some of the configuration data:

  • Hostname and FQDN - because it breaks Calico.

  • Node role - for example, from Kubernetes Master to Node role. However, you can use the kubectl label node command to reset a node labels later.

You can change the following information:

  • Host IP(s)

  • MAC addresses

  • Operating system

  • Application certificates

Caution

All Master nodes must serve the same apiserver certificate. Otherwise, service tokens will become invalidated.

To reprovision a Kubernetes Node:

  1. In the MAAS web UI, make the required changes to the target Kubernetes Node.

  2. Verify that MAAS works properly and provides the DHCP service to assign IP addresses and bootstrap an instance.

  3. Proceed with the Add a Kubernetes Node manually procedure starting the step 2.

Use the role-based access control (RBAC)

Caution

Kubernetes support termination notice

Starting with the MCP 2019.2.5 update, the Kubernetes component is no longer supported as a part of the MCP product. This implies that Kubernetes is not tested and not shipped as an MCP component. Although the Kubernetes Salt formula is available in the community driven SaltStack formulas ecosystem, Mirantis takes no responsibility for its maintenance.

Customers looking for a Kubernetes distribution and Kubernetes lifecycle management tools are encouraged to evaluate the Mirantis Kubernetes-as-a-Service (KaaS) and Docker Enterprise products.

After you enable the role-based access control (RBAC) on your Kubernetes cluster as described in Deployment Guide: Enable RBAC, you can start controlling system access to authorized users by creating, changing, or restricting user or services roles as required. Use the kubernetes.control.role state to orchestrate the role and role binding.

The following example illustrates a configuration of a brand-new role and role binding for a service account:

control:
  role:
    etcd-operator:
      kind: ClusterRole
      rules:
        - apiGroups:
            - etcd.coreos.com
          resources:
            - clusters
          verbs:
            - "*"
        - apiGroups:
            - extensions
          resources:
            - thirdpartyresources
          verbs:
            - create
        - apiGroups:
            - storage.k8s.io
          resources:
            - storageclasses
          verbs:
            - create
        - apiGroups:
            - ""
          resources:
            - replicasets
          verbs:
            - "*"
      binding:
        etcd-operator:
          kind: ClusterRoleBinding
          namespace: test # <-- if no namespace, then it is ClusterRoleBinding
          subject:
            etcd-operator:
              kind: ServiceAccount

The following example illustrates a configuration of the test edit permissions for a User in the test namespace:

kubernetes:
  control:
    role:
      edit:
        kind: ClusterRole
        # No rules defined, so only binding will be created assuming role
        # already exists.
        binding:
          test:
            namespace: test
            subject:
              test:
                kind: User

OpenContrail operations

This section describes how to configure and use the OpeContrail-related features. To troubleshoot the OpenContrail services, refer to Troubleshoot OpenContrail.

Verify the OpenContrail status

To ensure that OpenContrail is up and running, verify the status of all the OpenContrail-related services. If any service fails, restart it as described in Restart the OpenContrail services.

For OpenContrail 4.x

  1. Log in to the Salt Master node.

  2. Apply one of the following states depending on the Build ID of your MCP cluster:

    • For MCP Build ID 2018.11.0 or later:

      salt -C 'ntw* or nal*' state.sls opencontrail.upgrade.verify
      

      If the state is applied successfully, it means that all OpenContrail services are up and running.

    • For MCP Build ID 2018.8.0 or 2018.8.0-milestone1:

      salt -C 'ntw* or nal*' cmd.run 'doctrail all contrail-status'
      

      In the output, all services must be either in active or backup (for example, for contrail-schema, contrail-svc-monitor, contrail-device-manager services) state.

For OpenContrail 3.2

Eventually, all services should be active except for contrail-device-manager, contrail-schema, and contrail-svc-monitor. These services are in the active state at only one OpenContrail controller ntw node in the cluster switching dynamically between the nodes in case of a failure. On two other OpenContrail controller nodes, these services are in the backup state.

To verify the OpenContrail services status, apply the following state for the OpenContrail ntw and nal nodes from the Salt Master node:

salt -C 'ntw* or nal*' cmd.run 'contrail-status'

Example of system response:

== Contrail Control ==
supervisor-control:           active
contrail-control              active
contrail-control-nodemgr      active
contrail-dns                  active
contrail-named                active

== Contrail Analytics ==
supervisor-analytics:         active
contrail-analytics-api        active
contrail-analytics-nodemgr    active
contrail-collector            active
contrail-query-engine         active
contrail-snmp-collector       active
contrail-topology             active

== Contrail Config ==
supervisor-config:            active
contrail-api:0                active
contrail-config-nodemgr       active
contrail-device-manager       initializing
contrail-discovery:0          active
contrail-schema               initializing
contrail-svc-monitor          initializing
ifmap                         active

== Contrail Web UI ==
supervisor-webui:             active
contrail-webui                active
contrail-webui-middleware     active

== Contrail Database ==
supervisor-database:          active
contrail-database             active
contrail-database-nodemgr     active

== Contrail Support Services ==
supervisor-support-service:   active
rabbitmq-server               active

Restart the OpenContrail services

You may need to restart an OpenContrail service, for example, during an MCP cluster update or upgrade when a service failure is caused by the asynchronous restart order of the OpenContrail services after the kvm nodes update or reboot.

All OpenContrail 4.x services run as the systemd services in a Docker container.

All OpenContrail 3.2 services are managed by the process supervisord. The supervisord daemon is automatically installed with the OpenContrail packages including the following OpenContrail Supervisor groups of services:

  • supervisor-database

  • supervisor-config

  • supervisor-analytics

  • supervisor-control

  • supervisor-webui

To restart the OpenContrail 4.x services:

  1. Log in to the Salt Master node.

  2. Restart the required service on the corresponding OpenContrail node using the following example:

    salt 'ntw03' cmd.run 'doctrail controller service contrail-api restart'
    

    Note

    For a list of OpenContrail containers names to be used by the doctrail utility, see: The doctrail utility for the OpenContrail containers in OpenStack.

  3. If restarting of a service in question does not change its failed status, proceed to further troubleshooting as described in Troubleshoot OpenContrail. For example, to troubleshoot Cassandra not starting, refer to Troubleshoot Cassandra for OpenContrail 4.x.

To restart the OpenContrail 3.2 services:

  1. Log in to the required OpenContrail node.

  2. Select from the following options:

    • To restart the OpenContrail services group as a whole Supervisor, use the service <supervisor_group_name> restart command. For example:

      service supervisor-control restart
      
    • To restart individual services inside the Supervisor group, use the service <supervisor_group_service_name> restart command. For example:

      service contrail-config-nodemgr restart
      

    To identify the services names inside a specific OpenContrail Supervisor group, use the supervisorctl -s unix:///tmp/supervisord_<group_name>.sock status command. For example:

    supervisorctl -s unix:///tmp/supervisord_database.sock status
    

    Example of system response:

    contrail-database                RUNNING    pid 1349, uptime 2 days, 21:12:33
    contrail-database-nodemgr        RUNNING    pid 1347, uptime 2 days, 21:12:33
    
    supervisorctl -s unix:///tmp/supervisord_config.sock status
    
    contrail-api:0                   RUNNING    pid 49848, uptime 2 days, 20:11:54
    contrail-config-nodemgr          RUNNING    pid 49845, uptime 2 days, 20:11:54
    contrail-device-manager          RUNNING    pid 49849, uptime 2 days, 20:11:54
    contrail-discovery:0             RUNNING    pid 49847, uptime 2 days, 20:11:54
    contrail-schema                  RUNNING    pid 49850, uptime 2 days, 20:11:54
    contrail-svc-monitor             RUNNING    pid 49851, uptime 2 days, 20:11:54
    ifmap                            RUNNING    pid 49846, uptime 2 days, 20:11:54
    
    supervisorctl -s unix:///tmp/supervisord_analytics.sock status
    
    contrail-analytics-api           RUNNING    pid 1346, uptime 2 days, 21:13:17
    contrail-analytics-nodemgr       RUNNING    pid 1340, uptime 2 days, 21:13:17
    contrail-collector               RUNNING    pid 1344, uptime 2 days, 21:13:17
    contrail-query-engine            RUNNING    pid 1345, uptime 2 days, 21:13:17
    contrail-snmp-collector          RUNNING    pid 1341, uptime 2 days, 21:13:17
    contrail-topology                RUNNING    pid 1343, uptime 2 days, 21:13:17
    
    supervisorctl -s unix:///tmp/supervisord_control.sock status
    
    contrail-control                 RUNNING    pid 1330, uptime 2 days, 21:13:29
    contrail-control-nodemgr         RUNNING    pid 1328, uptime 2 days, 21:13:29
    contrail-dns                     RUNNING    pid 1331, uptime 2 days, 21:13:29
    contrail-named                   RUNNING    pid 1333, uptime 2 days, 21:13:29
    
    supervisorctl -s unix:///tmp/supervisord_webui.sock status
    
    contrail-webui                   RUNNING    pid 1339, uptime 2 days, 21:13:44
    contrail-webui-middleware        RUNNING    pid 1342, uptime 2 days, 21:13:44
    

Access the OpenContrail web UI

Your OpenContrail cluster may not use SSH overall because of not having a certificate authority available. By default, OpenContrail uses SSL and requires certificate authentication. If you attempt to access the OpenContrail UI through the proxy with such configuration, the UI will accept your credentials but will end up in logging you out immediately. As a workaround, you can use HTTP directly to the OpenContrail web UI management VIP bypassing the proxy.

To access the OpenContrail web UI:

  1. Obtain the Administrator credentials. Select from the following options depending on your cluster type:

    • For OpenStack:

      1. Log in to the Salt Master node.

      2. Apply the following state:

        salt 'ctl01*' cmd.run 'cat /root/keystonerc'
        
      3. From the output of the command above, record the values of OS_USERNAME and OS_PASSWORD.

    • For Kubernetes:

      1. Log in to any OpenContrail controller node.

      2. Run the following command:

        cat /etc/contrail/contrail-webui-userauth.js | grep "auth.admin"'
        
      3. From the output of the command above, record the values of auth.admin_user and auth.admin_password.

  2. In a browser, type either the OpenStack controller node VIP or the Kubernetes controller node VIP on port 8143. For example, https://172.31.110.30:8143.

  3. On the page that opens, configure your web browser to trust the certificate if you have not done so yet:

    • In Google Chrome or Chromium, click Advanced > Proceed to <URL> (unsafe).

    • In Mozilla Firefox, navigate to Advanced > Add Exception, enter the URL in the Location field, and click Confirm Security Exception.

    Note

    For other web browsers, the steps may vary slightly.

  4. Enter the Administrator credentials obtained in the step 1. Leave the Domain field empty unless the default configuration was customized.

  5. Click Sign in.

Configure route targets for external access

Configuring the OpenContrail route targets for your Juniper MX routers allows extending the private network outside the MCP cloud.

To configure route targets for external access:

  1. Log in to the OpenContrail web UI as described in Access the OpenContrail web UI.

  2. Navigate to Configure > Networking > Networks.

  3. Click the gear icon of the network that you choose to be external and select Edit.

  4. In the Edit window:

    1. Expand Advanced Options.

    2. Select the Shared and External check boxes.

    3. Expand Route Target(s).

    4. Click the + symbol to add ASN and Target.

    5. Enter the corresponding numbers set during provisioning of the Juniper MX router that is used in your MCP cluster.

    6. Click Save.

  5. Verify the route targets configuration:

    1. Navigate to Configure > Infrastructure > BGP Routers.

    2. Expand one of the BGP Router or Control Node nodes menu.

    3. Verify that Autonomous System fits ASN set in the previous steps.

Enable Long Lived Graceful Restart in OpenContrail

Warning

Enabling LLGR causes restart of the Border Gateway Protocol (BGP) peerings.

Enabling of Long Lived Graceful Restart (LLGR) must be performed on both sides of peering - edge gateways and the OpenContrail control plane.

To enable LLGR:

  1. Log in to the MX Series router CLI.

  2. Add the following lines to the router configuration file:

    set protocols bgp group <name> family inet-vpn unicast graceful-restart long-lived restarter stale-time 20
    set protocols bgp group <name> graceful-restart restart-time 1800
    set protocols bgp group <name> graceful-restart stale-routes-time 1800
    
  3. Commit the configuration changes to the router.

  4. Open your GitHub MCP project repository.

  5. Add the following lines to cluster/<name>/opencontrail/control.yml:

    classes:
    ...
    - system.opencontrail.client.resource.llgr
    ...
    
  6. Commit and push the changes to the project Git repository.

  7. Log in to the Salt Master node.

  8. Pull the latest changes of the cluster model and the system model that has the system.opencontrail.client.resource.llgr class defined.

  9. Update the salt-formula-opencontrail package.

  10. Apply the opencontrail state:

    salt -C 'I@opencontrail:config and *01*' state.sls opencontrail.client
    

Use the OpenContrail API client

The contrail-api-cli command-line utility interacts with the OpenContrail API server that allows searching for or modifying API resources as well as supports the unix-style commands. For more information, see the Official contrail-api-cli documentation.

This section contains the following topics:

Install the OpenContrail API client

To install contrail-api-cli:

  1. Log in to any OpenContrail controller node. For example, ntw01.

  2. Install the Python virtual environment for contrail-api-cli:

    apt-get install python-pip python-dev -y &&\
    pip install virtualenv && \
    virtualenv contrail-api-cli-venv && \
    source contrail-api-cli-venv/bin/activate && \
    git clone https://github.com/eonpatapon/contrail-api-cli/ && \
    cd contrail-api-cli;sudo python setup.py install
    

Access the OpenContrail API client

To access the OpenContrail API:

  1. Use the keystonerc file with credentials and endpoints:

    source /root/keystonerc
    source /root/keystonercv3
    
  2. Connect to the OpenContrail API using the following command:

    contrail-api-cli --host 10.167.4.20 --port 9100 shell
    

    Or you can use your OpenStack credentials. For example:

    contrail-api-cli --os-user-name admin --os-password workshop  \
    --os-auth-plugin v2password --host 10.10.10.254 --port 8082 --protocol http \
    --insecure --os-auth-url http://10.10.10.254:5000/v2.0 --os-tenant-name admin shell
    

Note

  • MCP uses the 9100 port by default, whereas the OpenContrail API standard port is 8082.

  • For the ln command, define a schema in the --schema-version 3.1 parameter. The known versions are the following: 2.21, 3.2, 3.0, 1.10, 3.1.

The contrail-api-cli-extra package

The contrail-api-cli-extra package contains the contrail-api-cli commands to make the OpenContrail installation and operation process easier.

The commands are grouped in different sub-packages and have different purposes:

  • clean: detect and remove bad resources

  • fix: detect and fix bad resources

  • migration: handle data migration when upgrading OpenContrail to a major version

  • misc: general-purpose commands

  • provision: provision and configure an OpenContrail installation

To install the contrail-api-cli-extra package:

Run the following command:

pip install contrail-api-cli-extra

The most used contrail-api-cli-extra sub-packages are the following:

  • contrail-api-cli_extra.clean - since the sub-package allows removing resources, you must explicitly load the contrail_api_cli.clean namespace to run the commands of this sub-package.

    Example of usage:

    contrail-api-cli --ns contrail_api_cli.clean <command>
    # usage
    contrail-api-cli --host 10.167.4.21 --port 9100 --ns contrail_api_cli.clean shell
    

    This package includes the clean-<type> command. Replace type with the required type of cleaning process. For example:

    • clean-route-target

    • clean-orphaned-acl

    • clean-si-scheduling

    • clean-stale-si

  • contrail_api_cli_extra.fix - allows you to verify and fix misconfigured resources. For example, fix multiple security groups or association of a subnet with a virtual network (VN) in a key-value store.

    If this sub-package is installed, it launches with contrail-api-cli automatically.

    Example of usage:

    fix-vn-id virtual-network/600ad108-fdce-4056-af27-f07f9faa5cae --zk-server 10.167.4.21
    
    fix-zk-ip --dry-run --zk-server 10.167.4.21:2181 vitrual-network/xxxxxx-xxxxxxx-xxxxxx
    

Define aging time for flow records

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To prevent high memory consumption by vRouter on highly loaded clusters, you can define the required aging time for flow records. Flows are aged depending on the inactivity for a specific period of time. By default, the timeout value is 180 seconds. You can configure the timeout depending on your cluster load needs using the flow_cache_timeout parameter for the contrail-vrouter-agent service.

To configure flow_cache_timeout:

  1. Log in to the Salt Master node.

  2. In classes/cluster/<cluster_name>/opencontrail/compute.yml of your Reclass model, define the required value in seconds for the flow_cache_timeout parameter:

    parameters:
    
      opencontrail:
        ...
        compute
          ...
          flow_cache_timeout: 180
          ...
    
  3. Apply the changes:

    salt -C 'I@opencontrail:compute' state.apply opencontrail.compute
    

OpenContrail 4.x-specific operations

This section contains the OpenContrail operations applicable to version 4.x.

Modify the OpenContrail configuration

The OpenContrail v4.x configuration files are generated by SaltStack and then mounted into containers as volumes. SaltStack also provides a mechanism to apply configuration changes by restarting the systemd services inside a container.

To modify the OpenContrail v4.x configurations:

  1. Log in to the Salt Master node.

  2. Make necessary configuration changes in the classes/cluster/<cluster_name>/opencontrail folder.

  3. Apply the opencontrail state specific to the OpenContrail configuration file where the changes were made. For example, if you made changes in the database.yaml file:

    salt -C 'I@opencontrail:database' state.apply opencontrail
    

    After the state is applied, the systemd services are automatically restarted inside the OpenContrail container in question to apply the configuration changes.

The doctrail utility for the OpenContrail containers in OpenStack

In an OpenStack-based MCP environment, the OpenContrail installation includes the doctrail utility that provides an easy access to the OpenContrail containers. Its default installation folder is /usr/bin/doctrail.

The doctrail usage is as follows:

doctrail {analytics|analyticsdb|controller|all} {<command_to_send>|console}

The acceptable destinations used by doctrail are as follows:

  • controller for the OpenConrail controller container (console, commands)

  • analytics for the OpenConrail analytics container (console, commands)

  • analyticsdb for the OpenConrail database of the analytics container (console, commands)

  • all for all containers on the host (commands only)

The doctrail commands examples:

# Show contrail-status on all containers on this host
doctrail all contrail-status

# Show contrail-status on controller container
doctrail controller contrail-status

# Restart contrail-database on controller container
doctrail controller service contrail-database restart

# Connect to the console of controller container
doctrail controller console

# Connect to the console of analytics container
doctrail analytics console

Set multiple contrail-api workers

In the MCP Build ID 2019.2.0, by default, one worker of the contrail-api service is used. Starting from the MCP 2019.2.3 maintenance update, six workers are used by default.

If needed, you can change the default configuration using the instruction below. This section also describes how to stop, start, or restart multiple workers.

To set multiple contrail-api workers for the MCP version 2019.2.3 or later:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In cluster/<cluster name>/opencontrail/control.yml, set the required amount of workers:

    parameters:
      _param:
        opencontrail_api_workers_count: <required_amount_of_workers>
    
  3. Log in to the Salt Master node.

  4. Refresh pillars:

    salt '*' saltutil.refresh_pillar
    salt-call state.sls reclass.storage
    
  5. Apply the Reclass model changes:

    salt -C 'I@opencontrail:control' state.apply opencontrail
    

To set multiple contrail-api workers for the Build ID 2019.2.0:

Caution

Using the configuration below, you can start setting network entities in a newly created OpenStack project only one minute after this project is created.

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In cluster/<cluster name>/opencontrail/control.yml, set the required amount of workers:

    parameters:
      _param:
        opencontrail_api_workers_count: <required_amount_of_workers>
    
  3. Log in to the Salt Master node.

  4. Refresh pillars:

    salt '*' saltutil.refresh_pillar
    salt-call state.sls reclass.storage
    
  5. Apply the Reclass model changes:

    salt -C 'I@opencontrail:control' state.apply opencontrail
    
  6. Log in to any ntw node.

  7. In /etc/contrail/contrail-api.conf, change the following parameter:

    [KEYSTONE]
    keystone_sync_on_demand=true
    

    to

    [KEYSTONE]
    keystone_sync_on_demand=false
    
  8. Restart the OpenContrail controller Docker container:

    cd /etc/docker/compose/opencontrail/; docker-compose down; docker-compose up -d
    
  9. Wait until all OpenContrail controller services are up and running. To verify the OpenContrail services status, refer to Verify the OpenContrail status.

  10. Repeat the steps 7-9 on the remaining ntw nodes.

To stop, start, or restart multiple workers:

Caution

We recommend that you do not stop, start, or restart the contrail-api workers by executing the service command as it may cause an unstable workers behavior such as an incorrect number of running workers and race conditions.

  • To stop all contrail-api workers on the target node:

    systemctl stop contrail-api@*
    
  • To start all contrail-api workers on the target node:

    systemctl start contrail-api@*
    
  • To restart all contrail-api workers on the target node:

    systemctl restart contrail-api@*
    

Note

You may stop, start, or restart a certain worker by using the worker ID instead of the * character.

Enable SSL for an OpenContrail API internal endpoint

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

This section describes how to manually enable SSL for an internal endpoint of the OpenContrail API.

To enable SSL for an OpenContrail API internal endpoint:

  1. Open your Git project repository with Reclass model on the cluster level.

  2. In cluster/<cluster_name>/opencontrail/init.yml, add the following parameters:

    parameters:
      _param:
    ...
        opencontrail_api_ssl_enabled: True
        opencontrail_api_protocol: https
    ...
    
  3. In cluster/<cluster_name>/opencontrail/analytics.yml, cluster/<cluster_name>/opencontrail/control.yml, and cluster/<cluster_name>/opencontrail/compute.yml, add the following class:

    classes:
    ...
      - system.salt.minion.cert.opencontrail.api
    ...
    
  4. Log in to the Salt Master node.

  5. Create certificates on the OpenContrail nodes:

    salt -C "I@opencontrail:database or I@opencontrail:compute" state.sls salt.minion.cert
    
  6. Configure the HAProxy services:

    salt -C "I@opencontrail:control" state.apply haproxy
    
  7. Configure the OpenContrail services:

    salt -C "I@opencontrail:database" state.apply opencontrail exclude=opencontrail.compute
    salt -C "I@opencontrail:compute" state.apply opencontrail.client
    
  8. Configure the nodes that run neutron-server:

    salt -C "I@neutron:server" state.apply neutron.server
    
  9. Restart the neutron-server service to use SSL to connect to contrail-api VIP:

    salt -C "I@neutron:server" service.restart neutron-server
    

DevOps Portal

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

MCP’s Operations Support System (OSS), known as StackLight, now includes DevOps Portal connected to OSS services. DevOps Portal significantly reduces the complexity of Day 2 cloud operations through services and dashboards with a high degree of automation, availability statistics, resource utilization, capacity utilization, continuous testing, logs, metrics, and notifications. DevOps Portal enables cloud operators to manage larger clouds with greater uptime without requiring large teams of experienced engineers and developers.

This solution builds on MCP operations-centric vision of delivering a cloud environment through a CI/CD pipeline with continuous monitoring and visibility into the platform.

The portal collects a comprehensive set of data about the cloud, offers visualization dashboards, and enables the cloud operator to interact with a variety of tools. More specifically, the DevOps Portal includes the following dashboards:

Push Notification

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Push Notification service enables users to send notifications, execute API calls, and open tickets on helpdesk. The service can also connect several systems through a notification queue that transforms and routes API calls from source systems into specific protocols that other systems use to communicate. With the notification system, you can send an API call and transform it into another API call, an email, an AMQP message, or another protocol.

Note

The Push Notification service depends on the following services:

  • DevOps Portal web UI

  • Elasticsearch cluster (version 5.x.x or higher)

  • PostgreSQL database

To view and search for generated notifications, navigate to the Notifications dashboard available in the DevOps Portal UI.

Cloud Health

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Health dashboard is the UI for the Cloud Health Service.

The Cloud Health service collects availability results for all cloud services and failed customer (tenant) interactions (FCI) for a subset of those services. These metrics are displayed so that operators can see both point-in-time health status and trends over time.

Note

The Cloud Health service depends on the following services:

  • DevOps Portal web UI

  • Grafana service of StackLight LMA

To view the metrics:

  1. Log in to the DevOps Portal.

  2. Navigate to the Cloud health dashboard.

  3. View the metrics on tabs depending on your needs:

    • The Availability tab for the availability results for all cloud services

    • The FCI tab for FCI for a subset of cloud services

Cloud Intelligence

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Intelligence service collects and stores data from MCP services, including OpenStack, Kubernetes, bare metal, and others. The data can be queried to enable use cases such as cost visibility, business insights, cost comparison, chargeback/showback, cloud efficiency optimization, and IT benchmarking. Operators can interact with the resource data using a wide range of queries, for example, searching for the last VM rebooted, total memory consumed by the cloud, number of containers that are operational, and so on.

Note

The Cloud Intelligence service depends on the following services:

  • DevOps Portal web UI

  • Elasticsearch cluster (version 5.x.x or higher)

  • Runbook Automation

To start creating queries that will be submitted to the search engine and display the list of resources:

  1. Log in to the DevOps Portal.

  2. Navigate to the Cloud Intelligence dashboard.

  3. Create your query using the cloud intelligence search query syntax:

    • To search by groups, use:

      • type=vm for instances

      • type=image for images

      • type=flavor for flavors

      • type=host for hosts

      • type=availabilityZone for availability zones

      • type=network for networks

      • type=volume for volumes

      • type=stack for Heat stacks

      • type=tenant for tenants

      • type=user for users

    • To search by field names, specify the field name with the value it contains. For example:

      Note

      The search by query string is case-insensitive.

      • status=active displays all resources with Active in the Status field, meaning they are in active status

      • status=(active saving) displays all resources in Active or Saving statuses

      • name="test_name" displays all resources which Name fields contain the exact phrase test_name

    • To group a number of queries in a single one, use the following boolean search operators:

      Type search

      Operator

      Meaning

      Usage example

      |

      The pipe symbol stands for OR operation

      minDisk=0 | minRam=0

      +

      The plus symbol stands for AND operation

      minDisk=0 + minRam=0

      -

      The minus symbol negates a single token

      minRam=0 + -minDisk=40 searches for resource with minRam equal to 0 and minDisk not equal to 40 at the same time

      ( )

      Parentheses signify grouping and precedence

      (minDisk=0 minRam=0) + minDisk=40

    • To search for the reserved characters, escape them with \. The whole list of these characters includes + - = & | > < ! ( ) { } [ ] ^ " ~ * ? : \ /.

  4. View search results:

    • An item name and most important properties are visible by default.

    • To view the full item properties list, click on the item block.

    Note

    If you have the Cleanup service enabled, the Create Janitor rule button is available for the groups that Janitor supports like VMs, Images, and Tenants. The button provides the same functionality as the Create new rule button on the Janitor dashboard with the conditions list prefilled with item-specific properties.

  5. Export search results into JSON, YAML, and CSV formats using the corresponding buttons on the Cloud Intelligence dashboard. The exported data contains the original query and the resulting groups with their items.

Cloud Vector

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Vector dashboard uses a node graph to represent a cloud environment in a form of a cloud map. The entities that build the map include availability zones (AZs), hosts, VMs, and services. Each host represents a compute node in a particular AZ with all VMs and services running on it. Thereby, a cloud map enables you to easily identify the number of nodes running in your cloud environment.

The screen capture below is an example of a cloud map created by Cloud Vector.

_images/devops-portal-cloud-vector.png

Note

The Cloud Vector dashboard depends on the following services:

  • DevOps Portal web UI

  • Cloud Intelligence service

To use the Cloud Vector dashboard:

  1. Log in to the DevOps Portal.

  2. Navigate to the Cloud Vector dashboard.

  3. Proceed with the following available actions as required:

    • Collapse child elements:

      Note

      Hosts with more than 50 child VMs are collapsed by default.

      Note

      The size of a host circle depends on the number of its child elements. The more VMs a host owns, the bigger it is.

      • Double-click on an AZ or a host to collapse its child elements. If a host is collapsed, the number of its VMs is displayed. Services are not collapsed when you collapse a host.

      • Use the slider to collapse the nodes which VMs count matches the specified conditions.

    • Expand child elements:

      • Double-click on a collapsed element to expand its child elements.

      • Click Expand all to expand all collapsed elements.

    • Drag elements on the canvas:

      • Drag a particular element to move it and all connected elements.

      • Drag the canvas background to change the position of all elements.

      • Click Reset zomming to reset canvas shifts.

    • Scale elements on the canvas:

      Note

      Red borders appear if elements are extended beyond the canvas boundaries.

      • Click on the canvas and scroll up or down to zoom in or out.

      • Click Reset zomming to reset scaling.

    • Show and hide node labels:

      • Use toggles to show or hide labels of particular entities.

      • Hover over a particular element to view its label.

Runbooks

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Runbooks Automation service enables operators to create a workflow of jobs that get executed at specific time intervals or in response to specific events. For example, operators can automate periodic backups, weekly report creations, specific actions in response to failed Cinder volumes, and so on.

Note

The Runbooks Automation service is not a lifecycle management tool, appropriate for reconfiguring, scaling, or updating MCP itself as these operations are exclusively performed with DriveTrain.

Note

The Runbooks Automation service depends on the following services:

  • DevOps Portal web UI

  • PostgreSQL database

Using the Runbooks dashboard, operators can call preconfigured jobs or jobs workflows and track the execution status through the web UI.

Before you proceed with the dashboard, you may need to configure your own jobs or reconfigure the already existing ones to adjust them to special needs of your installation.

Configure Rundeck jobs

Rundeck enables you to easily add jobs to the Runbook Automation service as Rundeck jobs and chain them into workflows. Once the jobs are created and added to your Reclass model, you execute them through the DevOps Portal UI.

Create users

To configure Users in the Rundeck service:

  1. Use the following structure to configure users through pillar parameters:

    parameters:
      rundeck:
        server:
          users:
            user1:
              name: USER_NAME_1
              password: USER_PWD_1
              roles:
                - user
                - admin
                - architect
                - deploy
                - build
            user2:
              name: USER_NAME_2
              password:  USER_PWD_2
              roles:
                - user
                - deploy
                - build
            ...
    

    Note

    Currently, the default access control list (ACL) properly supports only admin users. Therefore, the user2 user in the configuration structure above will not be able to run jobs and view projects.

  2. Create API tokens for non-interactive communications with Rundeck API. To configrue tokens, specify the required parameters in metadata. For example:

    parameters:
      rundeck:
        server:
          tokens:
            admin: token0
            User2: token4
    
  3. Apply the rundeck.server state:

    salt-call state.sls rundeck.server
    
  4. Restart the Rundeck service:

    docker service update --force rundeck_rundeck
    
Create projects

To create Projects in the Rundeck service:

  1. Use the following structure to configure projects through pillar parameters:

    parameters:
      rundeck:
        client:
          project:
            test_project:
              description: PROJECT_DECRIPTION
              node:
                node01:
                  nodename: NODE_NAME
                  hostname: HOST_NAME
                  username: USER_NAME
                  tags: TAGS
    

    For example:

    parameters:
      rundeck:
        client:
          project:
            test_project:
              description: "Test project"
              node:
                node01:
                  nodename:node-1
                  hostname: 10.20.0.1
                  username: runbook
                  tags: [cicd, docker]
                node02:
                  nodename: node-2
                  hostname: 10.20.0.2
                  username: runbook
                  tags: [cicd, docker]
                node03:
                  Nodename: node-3
                  hostname: 10.20.0.3
                  username: runbook
                  tags: [cicd, docker]
    

    All configured nodes in a particular project are available to run jobs and commands within this project. Also, nodes can be tagged that allows for filtering when executing commands and jobs on nodes.

  2. The Rundeck metadata has a preconfigured user to access other nodes. The user is called runbook. You need to configure it on nodes before you can use jobs or commands in projects. To configure the runbook user, inherit classes of your nodes from the following class specifying the rundeck_runbook_public_key and rundeck_runbook_private_key parameters:

    classes:
      - system.rundeck.client.runbook
    
  3. Apply the linux and openssh states:

    salt '*' state.sls linux.system.user
    salt '*' state.sls openssh.server
    
Configure jobs importing

You can configure a Git repository for each project and store Rundeck jobs in the repository. The following extract is an example of a Rundeck job that you can define in the Git repository for importing:

- description: Shows uptime of the system.
  executionEnabled: true
  group: systools
  loglevel: INFO
  name: uptime
  nodeFilterEditable: false
  nodefilters:
    dispatch:
      excludePrecedence: true
      keepgoing: true
      rankOrder: ascending
      threadcount: 1
    filter: tags:cicd
  nodesSelectedByDefault: true
  options:
  - enforced: true
    name: pretty
    values:
    - -p
  scheduleEnabled: true
  sequence:
    commands:
    - exec: uptime ${option.pretty}
    keepgoing: false
    pluginConfig:
      WorkflowStrategy:
        node-first: null
    strategy: node-first

This approach has the following limitations:

  • Changes introduced using the git commit --amend command are not supported.

  • The name parameter in job definition files is required. The value of the name parameter may differ from the file name and determines the resulting name of a job.

  • You can configure not more than one remote repository per project

  • An incorrect or non-existent branch definition may not result in a Salt configuration error leading to an empty job list.

  • The Salt state may not recover jobs if you have specified the branch incorrectly. For example, if the jobs are lost due to the incorrect branch definition, the synchronization of jobs may be lost even if the correct branch is defined later and the Salt state is restarted.

To configure job importing for a project:

  1. To use a remote repository as a source of jobs, extend the project’s metadata as required. A minimal configuration includes the address parameter for the import plugin:

    parameters:
      rundeck:
        client:
          project:
            test_project:
              plugin:
                import:
                  address: https://github.com/path/to/repo.git
    

    A complete list of all available parameters for the import plugin includes:

    Import plugin parameters

    Parameter and default value

    Values

    Description

    address: https://github.com/path/to/repo.git

    String

    A valid Git URL (Required)

    branch: master

    String

    The name of a repository branch (Optional)

    import_uuid_behavior: remove

    String

    preserve, remove, or archive

    The UUID importing mode in job descriptions (Optional)

    format: yaml

    String

    yaml or xml

    The extension of files containing job definitions (Optional)

    path_template: ${job.group}${job.name}.${config.format}

    String

    The pattern to recognize job definition files (Optional)

    file_pattern: '.*\.yaml'

    Regex

    The regex that filters jobs for importing (Optional)

    Example of the import plugin configuration with all the available parameters:

    parameters:
      rundeck:
        client:
          project:
            test_project:
              plugin:
                import:
                  address: https://github.com/akscram/rundeck-jobs.git
                  branch: master
                  import_uuid_behavior: remove
                  format: yaml
                  path_template: ${job.group}${job.name}.${config.format}
                  file_pattern: '.*\.yaml'
    
  2. Apply the Rundeck client state:

    salt-call state.sls rundeck.client
    
Configure iFrame forwarding

By default, the Rundeck service configuration does not enable you to get access through an external proxy address and exposed rundeck port, which is 14440 by default. Although, you can easily forward the Runbooks dashboard through a proxy endpoint in case of using Devops Portal through external proxy networks.

To configure iFrame forwarding:

  1. Configure iFrame forwarding on the cluster level by specifying the following parameters in the oss/client.yml:

    rundeck_forward_iframe: True
    rundeck_iframe_host: <external-proxy-endpoint>
    rundeck_iframe_port: <external-proxy-port>
    rundeck_iframe_ssl: False
    
  2. Apply the updated rundeck.server formula:

    salt -C 'I@rundeck:server' state.sls rundeck.server
    
  3. Verify that there are no cached modules, grains, and so on; and minion configuration is updated:

    salt '*' saltutil.clear_cache
    salt -C 'I@docker:swarm:role:master' state.sls salt
    
  4. Refresh and update your deployment:

    salt '*' saltutil.refresh_beacons
    salt '*' saltutil.refresh_grains
    salt '*' saltutil.refresh_modules
    salt '*' saltutil.refresh_pillar
    salt '*' saltutil.sync_all
    
  5. Recreate the Rundeck stack:

    docker stack rm rundeck
    salt -C 'I@docker:swarm:role:master' state.sls docker.client
    salt -C 'I@rundeck:client' state.sls rundeck.client
    
  6. Specify a custom endpoint for the DevOps portal on the cluster level of the Reclass model in the oss/client.yml file:

    devops_portal:
      config:
        service:
          rundeck:
            endpoint:
              address: ${_param:rundeck_iframe_host}
              port: ${_param:rundeck_iframe_port}
              https: ${_param:rundeck_iframe_ssl}
    
  7. Recreate the DevOps portal stack:

    docker stack rm devops-portal
    salt -C 'I@devops_portal:config' state.sls devops_portal.config
    salt -C 'I@docker:swarm:role:master' state.sls docker.client
    

Now, you can add an additional configuration for proxying the defined address and apply it on the proxy nodes.

Configure an external datasource

You can enable the Runbooks automation service to use an external datasource through the Salt metadata. This section explains how to configure the service to use the PostgreSQL database as an external source for the datastore.

To enable the PostgreSQL database support:

  1. Define the following parameters on the cluster level of your Reclass model in the oss/client.yml file:

    parameters:
      _param:
        rundeck_postgresql_username: rundeck
        rundeck_postgresql_password: password
        rundeck_postgresql_database: rundeck
        rundeck_postgresql_host: ${_param:control_vip_address}
        rundeck_postgresql_port: 5432
      rundeck:
        server:
          datasource:
            engine: postgresql
            host: ${_param:rundeck_postgresql_host}
            port: ${_param:rundeck_postgresql_port}
            username: ${_param:rundeck_postgresql_username}
            password: ${_param:rundeck_postgresql_password}
            database: ${_param:rundeck_postgresql_database}
    
  2. Recreate Rundeck and PostgreSQL stacks:

    docker stack rm postgresql rundeck
    salt -C 'I@rundeck:server' state.sls rundeck.server
    salt -C 'I@docker:swarm:role:master' state.sls docker.client
    salt -C 'I@postgresql:client' state.sls postgresql.client
    salt -C 'I@rundeck:client' state.sls rundeck.client
    
  3. Verify that the Rundeck tables exist in PostgreSQL by logging as a Rundeck user in to PostgreSQL from the monitoring node where the OSS services are running and checking the log output for the base_report table. For example:

    psql -h <postgresql_ip> -U rundeck -W password
    rundeck=> \d
                  List of relations
     Schema |            Name            |   Type   |  Owner
    --------+----------------------------+----------+---------
     public | auth_token                 | table    | rundeck
     public | base_report                | table    | rundeck
     public | execution                  | table    | rundeck
     public | hibernate_sequence         | sequence | rundeck
     public | log_file_storage_request   | table    | rundeck
     public | node_filter                | table    | rundeck
     public | notification               | table    | rundeck
     public | orchestrator               | table    | rundeck
     public | plugin_meta                | table    | rundeck
     public | project                    | table    | rundeck
     public | rdoption                   | table    | rundeck
     public | rdoption_values            | table    | rundeck
     public | rduser                     | table    | rundeck
     public | report_filter              | table    | rundeck
     public | scheduled_execution        | table    | rundeck
     public | scheduled_execution_filter | table    | rundeck
     public | storage                    | table    | rundeck
     public | workflow                   | table    | rundeck
     public | workflow_step              | table    | rundeck
     public | workflow_workflow_step     | table    | rundeck
    (20 rows)
    
    rundeck=> select * from base_report;
    

Run preconfigured jobs from web UI

The Rundeck jobs and workflows are run automatically depending on configuration. Though, the DevOps Portal enables you to run the preconfigured Rundeck jobs and workflows using the web UI and track the progress of their execution.

To run a Rundeck job:

  1. Log in to the DevOps Portal.

  2. Navigate to the Runbooks dashboard.

  3. Select the project you are interested in.

  4. Navigate to the Jobs tab in the top navigation bar. The jobs page will display all jobs you are authorized to view.

  5. If the jobs were defined inside groups, they will appear as a listing grouped into a folder. To reveal a folder content, press the folder icon.

  6. Navigate to a required job, and clck on it. The job details page opens. This page contains the configuration parameters for this job as well as statistics, activity, and definition details for it.

  7. To run the job, click Run job now.

  8. Once you have started the job execution, follow the job’s output in the Execution follow page.

See also

DriveTrain

Security

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Security dashboard is the UI for the Security Audit Service, aka Security Monkey. The service runs tests to track and evaluate security-related tenant changes and configurations. Using the Security dashboard, you can search through these audit findings and review them.

Note

The Security Audit service depends on the following services:

  • DevOps Portal web UI

  • Push Notification service

  • PostgreSQL database

To review security item details:

  1. Log in to the DevOps Portal.

  2. To quickly access current security items with unjustified issues click the Security issues widget on the Dashboard tab. The Security items page opens.

  3. Click the name of the required security item in the Items section to view its details. The item details page opens.

  4. To revise the configuration change that caused the security item raise, use the Revisions section. The affected parts of configuration are color coded:

    • Green stands for additions

    • Red stands for deletions

  5. To justify the unjustified issues, use the Issues section:

    1. Check the unjustified issue or issues.

    2. Edit the Justification field specifying the reason of justification. For example, This has been approved by the Security team.

    3. Click Justify.

  6. To attach a comment containing any required content to an item:

    1. In the Item comments section, paste the comment to the comment field.

    2. Click Add comment.

  7. To search for specific issues, use the Issues section. Each issue has a link to a page of the corresponding item containing the details of the issue.

Janitor

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Janitor dashboard is the UI for the Cleanup service, also known as Janitor Monkey.

The Cleanup service is a tool that enables you to automatically detect and clean up unused resources in your MCP deployment that may include:

  • OpenStack resources: virtual machines

  • AWS resources: instances, EBS volumes, EBS snapshots, auto scaling groups, launch configurations, S3 bucket, security groups, and Amazon Machine Images (AMI)

The architecture of the service allows for easy configuration of the scanning schedule as well as a number of other operations. This section explains how to configure the Cleanup service to fit your business requirements as well as how to use the Janitor dashboard in the DevOps Portal.

Note

The Cleanup service depends on the following services:

  • DevOps Portal web UI

  • Push Notification service

  • Cloud Intelligence service

Overview of the resource termination workflow

The resource termination workflow includes the following stages:

  1. Determining and marking the clean-up candidate.

    The Cleanup service applies a set of pre-configured rules to the resources available in your cluster on a regular basis. If any of the rules is hold, the resource becomes a clean-up candidate and is marked accordingly.

  2. Deleting the resource.

    The Cleanup service deletes the resource if the resource is marked as a clean-up candidate and the scheduled termination time passes. The resource owner can manually delete the resource to release it earlier.

Configure the scanning schedule

By default, the Janitor service scans your MCP cluster for unused resources once every hour from 11:00 a.m. to 11:59 p.m. in the Pacific Time Zone (PT) on week days. Though, you can easily override the default schedule by defining the simianarmy.properties environment variables depending on the needs of your environment.

The Janitor service schedule parameters

Parameter

Default

Description

simianarmy.scheduler.frequency

1

The parameter is connected to the frequencyUnit parameter determining the scanning cycle. The 1 frequency together with the HOURS frequencyUnit means that the scanning will be performed once every hour.

simianarmy.scheduler.frequencyUnit

HOURS

The available values include the java.util.concurrent.TimeUnit enum values.

simianarmy.calendar.openHour

11

Sets the time when the service starts performing any action (scheduling, deleting) on week days.

simianarmy.calendar.closeHour

11

Sets the time when the service stops performing any action (scheduling, deleting) on week days.

simianarmy.calendar.timezone

America/Los_Angeles

The time zone which the Janitor service operates in.

To configure the scanning schedule:

  1. Log in to the Salt Master node.

  2. In the classes/cluster/${_param:cluster_name}/oss/client.yml file of the Reclass model, define the Cleanup service schedule parameters as required. For example:

    docker:
      client:
        stack:
          janitor_monkey:
            environment:
              simianarmy.scheduler.frequency: 3
              simianarmy.scheduler.frequencyUnit: MINUTES
    
  3. To apply the changes, recreate the stack:

    salt -C 'I@docker:swarm:role:master' cmd.run 'docker stack rm janitor_monkey'
    salt '*' saltutil.refresh_pillar
    salt -C 'I@docker:swarm:role:master' state.sls docker.client
    

Clean up resources using web UI

The unused resources are cleaned up automatically according to the termination schedule. If you need to release an unused item earlier, you can terminate it manually using te DevOps Portal web UI.

To clean up resources manually:

  1. Log in to the DevOps Portal web UI.

  2. Navigate to the Janitor > Items tab.

  3. Check the items you need to clean up immediately from the list of resources scheduled for termination.

  4. Click Terminate.

Hardware Correlation

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Hardware (HW) correlation dashboard provides the capability to generate an on-demand dashboard that graphically illustrates the consumption of the physical host resources by VMs running on this host. For example, you can search for a VM and get a dashboard with the CPU, memory, and disk consumption for the compute node where this specific VM is running.

Note

The HW correlation dashboard depends on the following services:

  • DevOps Portal web UI

  • Cloud Intelligence service

  • Prometheus service of StackLight LMA

To use the HW correlation dashboard:

  1. Log in to the DevOps Portal.

  2. Navigate to the HW correlation dashboard.

  3. To generate a dashboard for a compute host(s) on which a specific VM(s) is running:

    1. If required, filter the VMs list by tenants where they are running using the VM tenants filter.

    2. Select a VM(s) from the VMs drop-down list.

      Note

      By default, if the VM tenants filter is not used, all available VMs are present in the VMs drop-down list.

    3. If required, select resources from the Resources drop-down list.

    4. Click Search.

  4. Read the generated dashboard:

    • View the name of a compute host to which specified VMs belong and the list of selected VMs which are running on this host.

    • Examine the line graphs illustrating the resources consumption. To view the values with their measures, hover over a required line.

      Note

      The y-axis may contain suffixes, such as K, M, G, and others. These suffixes correspond to prefixes of the units of measurement, such as Kilo, Mega, Giga, and so on depending on the measure.

  5. To scale graphs by y-axis:

    • Click Zoom In to set y-axis start to the lowest point of selected lines and the y-axis end to the highest point of selected lines.

    • Click Zoom Out to switch to the default view where y-axis starts at 0 and ends at the highest point of all the lines on a chart.

  6. To scale graphs by x-axis:

    • Click Expand to expand a graph to the full width.

    • Click Collapse to switch to the default view.

  7. To select and hide lines on graphs:

    • Click on an item under a graph or on a line itself to view only the selected line. Combine this action with Zoom In for the detailed view.

    • Hover over an item under a graph to highlight the related line and mute the others.

DriveTrain

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The DriveTrain dashboard provides access to a custom Jenkins interface. Operators can perform the following operations through the DevOps Portal UI:

  • View the list of Jenkins jobs by views with job names and last build statuses and descriptions.

  • View specific Jenkins job information including a list of builds with their statuses and descriptions as well as the stages for the last five builds.

  • Analyze a specific build information including stages, console output, and artifact list on a build information page.

  • View a job console output in real time and manage the build flow.

  • Execute Jenkins jobs with custom parameters and re-run builds.

To perform all the above operations, use the DriveTrain dashboard available in the DevOps Portal UI.

Cloud Capacity Management

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Cloud Capacity Management service provides point in time resource consumption data for OpenStack by displaying parameters such as total CPU utilization, memory utilization, disk utilization, and number of hypervisors. The related dashboard is based on data collected by the Cloud Intelligence service and can be used for cloud capacity management and other business optimization aspects.

Note

The Cloud Capacity Management service depends on the following services:

  • DevOps Portal web UI

  • Kibana service of StackLight LMA

Heatmaps

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The Heatmaps dashboard provides the information about resource consumption by the cloud environment identifying the load of each node and number of alerts triggered for each node. The dashboard includes heatmaps for the following data:

  • Memory utilization

  • CPU utilization

  • Disk utilization

  • Alerts triggered

Note

The Heatmaps dashboard depends on the following services:

  • DevOps Portal web UI

  • Prometheus service of StackLight LMA

To use the Heatmaps dashboard:

  1. Log in to the DevOps Portal.

  2. Navigate to the Heatmaps dashboard.

  3. Switch between the tabs to select a required heatmap.

    Each box on a heatmap represents a hypervisor. The box widget is color-coded:

    • Green represents a normal load or no alerts triggered

    • Orange represents a high load or low number of alerts

    • Red represents an overloaded node or a high number of alerts

  4. Specify the parameters for the data to be displayed:

    • Use Now, Last 5m, Last 15m, and Last 30m buttons to view data for a specific time period.

    • Use the Custom button to set custom time period. The time value format includes the number from 1 to 99 and a metric prefix that is m for minutes, h for hours, d for days, and w for weeks. For example, 12h, 3d, and so on.

    • Use the Max button to display a maximum value of resources consumption or number of alerts during the selected period of time.

    • Use the Avg button on the Memory, CPU, and Disk tabs to display an average value of resources consumption during the selected period of time.

    • Use the Diff button on the Alerts tab to display the count of alerts triggered since the selected period of time.

      Note

      On the Alerts tab, the 0 count of alerts means that either 0 alerts are triggered or Prometheus failed to receive the requested data for a specific node.

LMA

Warning

The DevOps Portal has been deprecated in the Q4`18 MCP release tagged with the 2019.2.0 Build ID.

The LMA tab provides access to the LMA (Logging, Monitoring, and Alerting) toolchain of the Mirantis Cloud Platform. More specifically, LMA includes:

  • The LMA > Logging tab to access Kibana

  • The LMA > Monitoring tab to access Grafana

  • The LMA > Alerting tab to access the Prometheus web UI

Note

The LMA tab is only available in the DevOps Portal with LMA deployments and depends on the following services:

  • DevOps Portal web UI

  • Prometheus, Grafana, and Kibana services of StackLight LMA

LMA Logging dashboard

The LMA Logging tab provides access to Kibana. Kibana is used for log and time series analytics and provides real-time visualization of the data stored in Elasticsearch.

To access the Kibana dashboards from the DevOps Portal:

  1. Log in to the DevOps Portal.

  2. Navigate to the LMA > Logging tab.

  3. Use Kibana as described in Manage Kibana dashboards and Use Kibana filters and queries.

LMA Monitoring dashboard

The LMA Monitoting tab provides access to the Grafana web service that builds and visually represents metric graphs based on time series databases. A collection of predefined Grafana dashboards contains graphs on particular monitoring endpoints.

To access the Grafana dashboards from the DevOps Portal:

  1. Log in to the DevOps Portal.

  2. Navigate to the LMA > Monitoring tab.

  3. Log in to Grafana.

  4. Select the required dashboard from the Home drop-down menu.

For information about the available Grafana dashboards, see View Grafana dashboards. To hide nodes from dashboards, see Hide nodes from dashboards.

LMA Alerting dashboard

The LMA Alerting tab provides access to the Prometheus web UI that enables you to view simple graphs, Prometheus configuration and rules, and states of the monitoring endpoints of your deployment.

To access the Prometheus web UI from the DevOps Portal:

  1. Log in to the DevOps Portal.

  2. Navigate to the LMA > Alerting tab.

  3. Use the upper navigation menu to view alerts, graphs, or statuses. See View graphs and alerts and View Prometheus settings for details.

StackLight LMA operations

Using StackLight LMA, the Logging, Monitoring, and Alerting toolchain of the Mirantis Cloud Platform, cloud operators can monitor OpenStack environments, Kubernetes clusters, and OpenContrail services deployed on the platform and be quickly notified of critical conditions that may occur in the system so that they can prevent service downtimes.

This section describes how to configure and use StackLight LMA.

Configure StackLight LMA components

Once you deploy StackLight LMA, you may need to modify its components. For example, you may need to configure the Prometheus database, define alerting rules, and so on. The configuration of StackLight LMA is stored in Reclass. Therefore, you must modify the Reclass model and re-execute the Salt formulas.

Configure Telegraf

The configuration of the Telegraf agent is stored in the telegraf section of the Reclass model.

To configure Telegraf:

  1. Log in to the Salt Master node.

  2. Configure the telegraf section in the classes/cluster/cluster_name/init.yml file of the Reclass model as required.

  3. Apply the Salt formula:

    salt -C 'I@linux:system' state.sls telegraf
    

Example configuration:

telegraf:
  agent:
    enabled: true
    interval: 15
    round_interval: false
    metric_batch_size: 1000
    metric_buffer_limit: 10000
    collection_jitter: 2
    output:
      prometheus_client:
        bind:
          address: 0.0.0.0
          port: 9126
        engine: prometheus

In the example above, the Reclass model is converted to a configuration file recognized by Telegraf. For details about options, see the Telegraf documentation and the */meta/telegraf.yml file in every Salt formula.

The input and output YAML dictionaries contain a list of defined inputs and outputs for Telegraf. To add input or output parameters to Telegraf, use the same format as used in */meta/telegraf.yml of the required Salt formula. However, this should be performed only by deployment engineers or developers.

Configure Prometheus

You may need to configure Prometheus, for example, to modify an existing alert. Prometheus configuration is stored in the prometheus:server section of the Reclass model.

To configure Prometheus:

  1. Log in to the Salt Master node.

  2. Configure the prometheus:server section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model as required.

  3. Update the Salt mine:

    salt -C 'I@salt:minion' state.sls salt.minion.grains
    salt -C 'I@salt:minion' saltutil.refresh_modules
    salt -C 'I@salt:minion' mine.update
    
  4. Apply the Salt formula:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
    

Example configuration:

prometheus:
  server:
    enabled: true
    bind:
      port: 9090
      address: 0.0.0.0
    storage:
      local:
        engine: "persisted"
        retention: "360h"
        memory_chunks: 1048576
        max_chunks_to_persist: 524288
        num_fingerprint_mutexes: 4096
    alertmanager:
      notification_queue_capacity: 10000
    config:
      global:
        scrape_interval: "15s"
        scrape_timeout: "15s"
        evaluation_interval: "1m"
        external_labels:
          region: 'region1'
    alert:
      PrometheusTargetDownKubernetesNodes:
        if: 'up{job="kubernetes-nodes"} != 1'
        labels:
          severity: down
          service: prometheus
        annotations:
          summary: 'Prometheus target down'

The following table describes the available settings.

Settings description

Setting

Description

storage

The storage YAML dictionary stores the configuration options for the Prometheus storage database. These options are passed to the Prometheus server through the command-line arguments.

config

The config YAML dictionary contains the options that will be placed in the Prometheus configuration file. For more information, see Prometheus configuration documentation.

alert

The alert YAML dictionary is used to generate Prometheus alerting rules. For more information, see Alerting rules.

Alternatively, you can import alerts from the */meta/prometheus.yml file of any Salt formula. However, this should be performed only by deployment engineers or developers.

Caution

The Prometheus data directory is mounted from the Docker host. If you restart a container, it can be spawned on a different host. This can cause Prometheus to start with an empty storage. In such case, the data will still be available on the previous host.

See also

Manage alerts

Configure Prometheus long-term storage

You may need to configure Prometheus long-term storage to change the external labels, scrape intervals and timeouts, and so on. Since Prometheus long-term storage and Prometheus Relay are connected, you can use the same configuration file to modify Prometheus Relay, for example, to change the bind port. The configuration of Prometheus long-term storage and Prometheus Relay is stored in the prometheus:server and prometheus:relay sections of the Reclass model.

To configure Prometheus long-term storage and Prometheus Relay:

  1. Log in to the Salt Master node.

  2. Configure the prometheus:server and prometheus:relay sections in the classes/cluster/<cluster_name>/stacklight/telemetry.yml file of the Reclass model as required.

  3. Apply the Salt formula:

    salt -C 'I@prometheus:relay' state.sls prometheus
    

Example configuration of Prometheus long-term storage:

prometheus:
 server:
   dir:
     config: /etc/prometheus
     data: /var/lib/prometheus/data
   bind:
     port: 9090
     address: 0.0.0.0
   storage:
     local:
       retention: 4320h
   config:
     global:
       scrape_interval: 30s
       scrape_timeout: 30s
       evaluation_interval: 15s
       external_labels:
         region: region1

Example configuration of Prometheus Relay:

prometheus
 relay:
   enabled: true
   bind:
     port: 8080
   client:
     timeout: 12

Note

Configuring the timeout for Prometheus Relay is supported starting from the MCP 2019.2.4 maintenance update. To obtain the feature, follow the steps described in Apply maintenance updates.

Configure Alertmanager

The configuration of Alertmanager is stored in the prometheus:alertmanager section of the Reclass model. For available configuration settings, see the Alertmanager documentation.

To configure Alertmanager:

  1. Log in to the Salt Master node.

  2. Configure the prometheus:alertmanager section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model as required.

  3. Apply the Salt formula:

    salt -C 'I@docker:swarm:role:master and I@prometheus:server' state.sls prometheus.alertmanager
    

Example configuration:

prometheus:
  alertmanager:
    enabled: true
    bind:
      address: 0.0.0.0
      port: 9093
    config:
      global:
        resolve_timeout: 5m
      route:
        group_by: ['alertname', 'region', 'service']
        group_wait: 60s
        group_interval: 5m
        repeat_interval: 3h
        receiver: HTTP-notification
      inhibit_rules:
        - source_match:
            severity: 'down'
          target_match:
            severity: 'critical'
          equal: ['region', 'service']
        - source_match:
            severity: 'down'
          target_match:
            severity: 'warning'
          equal: ['region', 'service']
        - source_match:
            severity: 'critical'
         target_match:
            severity: 'warning'
          equal: ['alertname', 'region', 'service']
      receivers:
        - name: 'HTTP-notification'
          webhook_configs:
            - url: http://127.0.0.1
              send_resolved: true

Configure the logging system components

The logging system components include Fluentd (log collector), Elasticsearch, Elasticsearch Curator, and Kibana. You can modify the Reclass model to configure the logging system components. For example, your can configure Fluentd to gather logs from a custom entity.

Configure Fluentd

Fluentd gathers system and service logs and pushes them to the default output destination such as Elasticsearch, file, and so on. You can configure Fluentd to gather logs from custom entities, remove the default entities from the existing Fluentd configuration, as well as to filter and route logs. Additionally, you can configure Fluentd to expose metrics generated from logs to Prometheus.

Configure logs gathering

You can configure Fluentd to gather logs from custom entities, remove the default entities from the existing Fluentd configuration, as well as to filter and route logs. During configuration, you can define the following parameters:

  • input, to gather logs from external sources such as a log file, TCP socket, and so on. For details, see Input plugin overview.

  • filter, to filter the log entries gathered by the Input plugin. For example, to add, change, or remove fields. For details, see Filter plugin overview.

  • match, to push final log entries to a given destination such as Elasticsearch, file, and so on. For details, see Output plugin overview.

  • label, to connect the inputs. Logs gathered by Fluentd are processed from top-to-bottom of a configuration file. The label parameter connects the inputs, filters, and matches into a single flow. Using the label parameter ensures that filters for a given label are defined after input and before match.

Note

Perform all changes in the Reclass model. Add the custom log parsing rules used by a single environment to the cluster model. Place the log parsing rules for all deployments to the /meta/ directory in the Reclass model for a particular node. For details, see the */meta/fluentd.yml file of the required Salt formula.

To configure logs gathering:

  1. Log in to the Salt Master node.

  2. On the cluster level, specify the following snippets in the Reclass model for a particular node as required:

    • To add a new input:

      fluentd:
        agent:
          config:
            input:
              file_name:
                input_name:
                  parameterA: 10
                  parameterB: C
                input_nameB:
                  parameterC: ABC
      
    • To add a new filter:

      fluentd:
        agent:
          config:
            filter:
              file_name:
                filter_name:
                  parameterA: 10
                  parameterB: C
                filter_nameB:
                  parameterC: ABC
      
    • To add a new match:

      fluentd:
        agent:
          config:
            match:
              file_name:
                match_name:
                  parameterA: 10
                  parameterB: C
                match_nameB:
                  parameterC: ABC
      
    • If the service requires a more advanced processing than gathering logs from an external source (input), add a label. For example, if you want to add filtering, use the label parameter that defines the whole flow. All entries in label are optional. So you can define filter and match but skip input.

      fluentd:
        agent:
          config:
            label:
              label_name:
                input:
                  input1:
                    parameter1: abc
                  input2:
                    parameter1: abc
                filter:
                  filter1:
                    parameter1: abc
                    parameter2: abc
                match:
                  match1:
                    parameter1: abc
      

      Example:

      fluentd:
        agent:
          config:
            label:
              docker:
                input:
                  container:
                    type: tail
                    tag: temp.docker.container.*
                    path: /var/lib/docker/containers/*/*-json.log
                    path_key: log_path
                    pos_file: {{ positiondb }}/docker.container.pos
                    parser:
                      type: json
                      time_format: '%Y-%m-%dT%H:%M:%S.%NZ'
                      keep_time_key: false
                filter:
                  enrich:
                    tag: 'temp.docker.container.**'
                    type: record_transformer
                    enable_ruby: true
                    record:
                      - name: severity_label
                        value: INFO
                      - name: Severity
                        value: 6
                      - name: programname
                        value: docker
                match:
                  cast_service_tag:
                    tag: 'temp.docker.container.**'
                    type: rewrite_tag_filter
                    rule:
                      - name: log_path
                        regexp: '^.*\/(.*)-json\.log$'
                        result: docker.container.$1
                  push_to_default:
                    tag: 'docker.container.*'
                    type: relabel
                    label: default_output
      
    • To forward the logs gathered from a custom service to the default output, change the final match statement to default_output.

      Example:

      fluentd:
        agent:
          config:
            label:
              custom_daemon:
                input:
                  
                match:
                  push_to_default:
                    tag: 'some.tag'
                    type: relabel
                    label: default_output
      

      Note

      The default output is defined in the system Reclass model. For details, see Default output. All Fluentd labels defined in /meta/ must use this mechanism to ensure log forwarding to the default output destination.

    • To disable input, filter, match, or label, specify enabled: false for the required Fluentd entity.

      Example:

      fluentd:
        agent:
          config:
            label:
              docker:
                enabled: false
      
  3. Apply the following state:

    salt -C 'node_name' state.sls fluentd
    
Add an additional output for Fluentd

If you have a syslog server and want StackLight LMA to send logs to this server, configure an additional output for Fluentd. In this case, Fluentd will push logs both to your syslog server and to Elasticsearch, which is the default target.

To add an additional output for Fluentd:

  1. Download and install the td-agent-additional-plugins package on every host that runs Fluentd:

    apt-get install --only-upgrade td-agent-additional-plugins
    
  2. Open your Git project repository with the Reclass model on the cluster level.

  3. In the classes/cluster/<cluster_name>/init.yml file, perform the following changes:

    1. Comment the system.fluentd.label.default_output.elasticsearch class.

    2. Copy the default_output parameters and rename to elasticsearch_output.

    3. To apply the existing filters for all outputs, copy the default output filter section to the new default output.

    4. Add syslog_output and specify the parameters as required.

    Example:

    classes:
    - system.fluentd
    - system.fluentd.label.default_metric
    - system.fluentd.label.default_metric.prometheus
    ## commented out
    #- system.fluentd.label.default_output.elasticsearch
    - system.fluentd.label.default_output.syslog
    parameters:
      fluentd:
        agent:
          plugin:
            fluent-plugin-remote_syslog:
              deb: ['td-agent-additional-plugins']
          config:
            label:
              ## renamed previous default_output -> elasticsearch_output
              elasticsearch_output:
                match:
                  elasticsearch_output:
                    tag: "**"
                    type: elasticsearch
                    host: ${_param:fluentd_elasticsearch_host}
                    port: ${_param:elasticsearch_port}
              syslog_output:
                match:
                  syslog_output:
                    tag: "**"
                    type: syslog
                    host: 127.0.0.1
                    port: 514
                    ## optional params:
                    # format: xxx
                    # severity: xxx
                    # facility: xxx
                    # protocol: xxx
                    # tls: xxx
                    # ca_file: xxx
                    # verify_mode: xxx
                    # packet_size: xxx
                    # timeout: xxx
                    # timeout_exception: xxx
                    # keep_alive: xxx
                    # keep_alive_idle: xxx
                    # keep_alive_cnt: xxx
                    # keep_alive_intvl: xxx
              default_output:
               ## copy of previous default_output filter section
                # filter: {}
                match:
                  send_to_default:
                    tag: "**"
                    type: copy
                    store:
                      - type: relabel
                        label: syslog_output
                      - type: relabel
                        label: elasticsearch_output
    
  4. Log in to the Salt Master node.

  5. Synchronize Salt modules and refresh Salt pillars:

    salt '*' saltutil.sync_all
    salt '*' saltutil.refresh_pillar
    
  6. Apply the following state:

    salt '*' state.sls fluentd
    
Enable sending CADF events to external SIEM systems

Note

This feature is available starting from the MCP 2019.2.4 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

You can configure Fluentd running on the RabbitMQ nodes to forward the Cloud Auditing Data Federation (CADF) events to specific external security information and event management (SIEM) systems, such as Splunk, ArcSight, or QRadar. The procedure below provides a configuration example for Splunk.

To enable sending CADF events to Splunk:

  1. Open your project Git repository with Reclass model on the cluster level.

  2. In classes/cluster/cluster_name/stacklight, create a custom notification channel, for example, fluentd_splunk.yml with the following pillar specifying the hosts and ports in the splunk_output and syslog_output parameters:

    parameters:
      fluentd:
        agent:
          config:
            label:
              audit_messages:
                filter:
                  get_payload_values:
                    tag: audit
                    type: record_transformer
                    enable_ruby: true
                    record:
                      - name: Logger
                        value: ${fluentd:dollar}{ record.dig("publisher_id") }
                      - name: Severity
                        value: ${fluentd:dollar}{ {'TRACE'=>7,'DEBUG'=>7,'INFO'=>6,\
                        'AUDIT'=>6,'WARNING'=>4,'ERROR'=>3,'CRITICAL'=>2}\
                        [record['priority']].to_i }
                      - name: Timestamp
                        value: ${fluentd:dollar}{ DateTime.strptime(record.dig\
                        ("payload", "eventTime"), "%Y-%m-%dT%H:%M:%S.%N%z").strftime\
                        ("%Y-%m-%dT%H:%M:%S.%3NZ") }
                      - name: notification_type
                       value: ${fluentd:dollar}{ record.dig("event_type") }
                      - name: severity_label
                        value: ${fluentd:dollar}{ record.dig("priority") }
                      - name: environment_label
                        value: ${_param:cluster_domain}
                      - name: action
                        value: ${fluentd:dollar}{ record.dig("payload", "action") }
                      - name: event_type
                        value: ${fluentd:dollar}{ record.dig("payload", "eventType") }
                      - name: outcome
                        value: ${fluentd:dollar}{ record.dig("payload", "outcome") }
                  pack_payload_to_json:
                    tag: audit
                    require:
                      - get_payload_values
                    type: record_transformer
                    enable_ruby: true
                    remove_keys: '["payload", "timestamp", "publisher_id", "priority"]'
                    record:
                      - name: Payload
                        value: ${fluentd:dollar}{ record["payload"].to_json }
                match:
                  send_to_default:
                    tag: "**"
                    type: copy
                    store:
                      - type: relabel
                        label: splunk_output
                      - type: relabel
                        label: syslog_output
              splunk_output:
                match:
                  splunk_output:
                    tag: "**"
                    type: splunk_hec
                    host: <splunk_host>
                    port: <splunk_port>
                    token: <splunk_token>
              syslog_output:
                match:
                  syslog_output:
                    tag: "**"
                    type: syslog
                    host: <syslog_host>
                    port: <syslog_port>
    
  3. In openstack/message_queue.yml:

    1. Replace the system.fluentd.notifications class with the following ones:

      classes:
      - system.fluentd.label.notifications.input_rabbitmq
      - system.fluentd.label.notifications.notifications
      
    2. Add the custom Fluentd channel as required. For example:

      cluster.<cluster_name>.stacklight.fluentd_splunk
      
  4. Log in to the Salt Master node.

  5. Apply the fluentd state on the msg nodes:

    salt -C 'I@rabbitmq:server' state.sls fluentd
    
Enable Fluentd to expose metrics generated from logs

You can enable exposing metrics that are based on the log events. This allows monitoring of various activities such as disk failures (metric hdd_errors_total). By default, Fluentd generates metrics from the logs it gathers. However, you must configure Fluentd to expose such metrics to Prometheus. Prometheus gathers Fluentd metrics as a static Prometheus endpoint. For details, see Add a custom monitoring endpoint. To generate metrics from logs, StackLight LMA uses the fluent-plugin-prometheus plugin.

To configure Fluentd to expose metrics generated from logs:

  1. Log in to the Salt Master node.

  2. Add the following class to the cluster/<cluster_name>/init.yml file of the Reclass model:

    system.fluentd.label.default_metric.prometheus
    

    This class creates a new label default_metric that is used as a generic interface to expose new metrics to Prometheus.

  3. (Optional) Create a filter for metric.metric_name to generate the metric.

    Example:

    reclass:
    fluentd:
      agent:
        label:
          default_metric:
            filter:
              metric_out_of_memory:
                tag: metric.out_of_memory
                type: prometheus
                metric:
                  - name: out_of_memory_total
                    type: counter
                    desc: The total number of OOM.
                label:
                  - name: host
                    value: ${Hostname}
              metric_hdd_errors_parse:
                tag: metric.hdd_errors
                type: parser
                key_name: Payload
                parser:
                  type: regexp
                  format: '/(?<device>[sv]d[a-z]+\d*)/'
              metric_hdd_errors:
                tag: metric.hdd_errors
                require:
                  - metric_hdd_errors_parse
                type: prometheus
                metric:
                  - name: hdd_errors_total
                    type: counter
                    desc: The total number of hdd errors.
                label:
                  - name: host
                    value: ${Hostname}
                  - name: device
                    value: ${device}
          systemd:
            output:
              push_to_default:
                tag: '*.systemd'
                type: copy
                store:
                  - type: relabel
                    label: default_output
                  - type: rewrite_tag_filter
                    rule:
                      - name: Payload
                        regexp: '^Out of memory'
                        result: metric.out_of_memory
                      - name: Payload
                        regexp: >-
                          'error.+[sv]d[a-z]+\d*'
                        result: metric.hdd_errors
                      - name: Payload
                        regexp: >-
                          '[sv]d[a-z]+\d*.+error'
                        result: metric.hdd_errors
              push_to_metric:
                tag: 'metric.**'
                type: relabel
                label: default_metric
    
Configure log rotation

Fluentd uses two options to modify the log files rotation, the logrotate parameter that controls log rotation on a daily basis and the internal td_agent_log_rotate_size parameter, which sets the internal log rotation by file size and is set to 10 MB by default. If a log file exceeds this limit, the internal log rotation service of Fluentd applies the log rotation. You can modify td_agent_log_rotate_size if required.

To configure log rotation:

  1. Log in to the Salt Master node.

  2. Specify the following parameter in the cluster/<cluster_name>/init.yml file of the Reclass model:

    parameters:
      fluentd:
        agent:
          td_agent_log_rotate_size: <custom_value_in_bytes>
    
  3. Apply the following state:

    salt -C 'I@fluentd:agent' state.sls fluentd
    
Configure Elasticsearch

The configuration parameters of Elasticsearch are defined in the corresponding sections of the Reclass model.

To configure Elasticsearch:

  1. Log in to the Salt Master node.

  2. Configure the parameters:elasticsearch section in the classes/cluster/<cluster_name>/stacklight/log.yml file of the Reclass model as required. For example, to limit the heap size, specify the following snippet:

    parameters:
      elasticsearch:
        server:
          heap:
            size: 31
    
  3. Apply the Salt state:

    salt -C 'I@elasticsearch:server' state.sls elasticsearch
    

For configuration examples, see the README.rst at Elasticsearch Salt formula.

Configure Elasticsearch Curator

The Elasticearch Curator tool manages the data (indices) and the data retention policy in Elasticsearch clusters. You can modify the indices and the retention policy.

To configure Elasticsearch Curator:

  1. Open your Reclass model Git repository on the cluster level.

  2. Modify the classes/cluster/<cluster_name>/stacklight/log_curator.yml file as required:

    • To configure indices, set the required prefixes using the elasticsearch_curator_indices_pattern parameter. The default value is "^(log|audit)-.*$", meaning that Curator manages the indices with log- and audit- prefixes.

    • To configure the retention policy for logs and audit indices, specify the elasticsearch_curator_retention_period parameter. The retention period is set to 31 days by default.

    • To configure the retention policy for notification indices, specify the elasticsearch_curator_notifications_retention_period parameter. The retention period is set to 90 days by default.

  3. Log in to the Salt Master node.

  4. Apply the following state:

    salt -C 'I@elasticsearch:server' state.sls_id elasticsearch_curator_action_config elasticsearch
    
Configure Kibana

The configuration parameters of Kibana are defined in the corresponding sections of the Reclass model.

To configure Kibana:

  1. Log in to the Salt Master node.

  2. Configure the parameters:kibana section in the classes/cluster/<cluster_name>/stacklight/server.yml of the Reclass model as required.

  3. Apply the Salt state:

    salt -C 'I@kibana:server' state.sls kibana
    

For configuration examples, see the README.rst at Kibana Salt formula.

Configure Grafana

The configuration of Grafana is stored in the grafana section of the Reclass model.

To configure Grafana:

  1. Log in to the Salt Master node.

  2. Configure the grafana section in the classes/cluster/<cluster_name>/stacklight/server.yml file of the Reclass model as required.

  3. Apply the Salt formulas:

    salt -C 'I@grafana:server' state.sls grafana.server
    salt -C 'I@grafana:client' state.sls grafana.client
    

Example configuration:

grafana:
  server:
    enabled: true
    bind:
      address: 127.0.0.1
      port: 3000
    database:
      engine: mysql
      host: 127.0.0.1
      port: 3306
      name: grafana
      user: grafana
      password: db_pass
    auth:
      basic:
        enabled: true
    admin:
      user: admin
      password: admin_pass
    dashboards:
      enabled: false
      path: /var/lib/grafana/dashboards

Configure InfluxDB

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

The configuration of InfluxDB is stored in the parameters:influxdb section of the Reclass model.

To configure InfluxDB:

  1. Log in to the Salt Master node.

  2. Configure the parameters:influxdb section in the classes/cluster/<cluster_name>/stacklight/server.yml file of the Reclass model as required.

  3. Apply the Salt state:

    salt -C 'I@influxdb:server' state.sls influxdb
    

For configuration examples, see the README.rst at InfluxDB Salt formula.

Enable Docker garbage collection

To avoid unused Docker images and volumes consuming the entire disk space, you can enable a clean-up cron job for old StackLight LMA containers and volumes. By default, the cron job runs daily at 6:00 a.m. and cleans stopped StackLight LMA images and containers that are older than one week.

To enable Docker garbage collection:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In classes/cluster/<cluster_name>/stacklight/server.yml, specify the following parameter:

    _param:
      docker_garbage_collection_enabled: true
    
  3. Optional. To change the default parameters, use:

    linux:
      system:
        cron:
          user:
            root:
              enabled: true
        job:
          docker_garbage_collection:
            command: docker system prune -f --filter until=$(date +%s -d "1 week ago")
            enabled: ${_param:docker_garbage_collection_enabled}
            user: root
            hour: 6
            minute: 0
    
  4. Log in to the Salt Master node.

  5. Apply the following state:

    salt -C 'I@docker:swarm' state.sls linux.system.cron
    

Configure authentication for Prometheus and Alermanager

Note

This feature is available starting from the MCP 2019.2.7 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

You can configure basic authentication to access Prometheus and Alertmanager web UI through the proxy nodes that are available if external access to cloud resources is enabled in your OpenStack deployment.

This section describes how to configure authentication for Prometheus and Alertmanager by defining new passwords instead of the default ones on the existing MCP deployments updated to 2019.2.7. For new clusters starting from the MCP 2019.2.7 maintenance update, you define custom credentials for Prometheus and Alertmanager during the cluster creation.

To configure authentication for Prometheus and Alertmanager:

  1. Log in to the Salt Master node.

  2. Obtain user names and passwords:

    1. For Alertmanager:

      salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_alertmanager_password
      salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_alertmanager_user
      
    2. For Prometheus:

      salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_server_user
      salt -C 'I@horizon:server' pillar.get _param:nginx_proxy_prometheus_server_password
      
  3. Change the default credentials for Prometheus and Alertmanager:

    1. Open the classes/cluster/<cluster_name>/stacklight/proxy.yml file for editing.

    2. Specify new passwords using the following parameters:

      parameters:
        _param:
           nginx_proxy_prometheus_alertmanager_password: <password>
           nginx_proxy_prometheus_server_password: <password>
      
    3. Optional. Specify new user names using the following parameters:

      parameters:
        _param:
           nginx_proxy_prometheus_alertmanager_user: <user_name>
           nginx_proxy_prometheus_server_user: <user_name>
      
  4. On all proxy nodes, synchronize Salt modules and apply the nginx state. For example:

    salt 'prxNode01*' saltutil.sync_all
    salt 'prxNode01*' state.sls nginx
    
  5. Verify authentication through the proxy nodes. For example:

    salt 'prxNode01*' pillar.get _param:cluster_vip_address
    salt 'prxNode01*' pillar.get nginx:server:site:nginx_proxy_prometheus_server:proxy:port
    salt 'prxNode01*' pillar.get nginx:server:site:nginx_proxy_prometheus_alertmanager:proxy:port
    curl https://<cluster_vip_address>:<prometheus_server_port>
    curl -u <username>:<password> https://<cluster_vip_address>:<prometheus_server_port>
    

Restart StackLight LMA components

You may need to restart one of the StackLight LMA components. For example, if its service hangs.

Restart services running in Docker Swarm

The Prometheus, Alertmanager, Alerta, Pushgateway, and Grafana services are running in the Docker Swarm mode. This section describes how to restart these services.

To restart services running in Docker Swarm:

  1. Log in to the Salt Master node.

  2. Issue one of the following commands depending on the service you want to restart:

    • To restart Prometheus:

      salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
      "docker service update monitoring_server --force"
      
    • To restart Alertmanager:

      salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
      "docker service update monitoring_alertmanager --force"
      
    • To restart Alerta:

      salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
      "docker service update monitoring_alerta --force"
      
    • To restart Pushgateway:

      salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
      "docker service update monitoring_pushgateway --force"
      
    • To restart Grafana:

      salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
      "docker service update dashboard_grafana --force"
      
    • To restart Prometheus Relay:

      salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
      "docker service update monitoring_relay --force"
      
    • To restart Prometheus Remote Agent:

      salt -C 'I@docker:swarm:role:master and I@prometheus:server' cmd.run \
      "docker service update monitoring_remote_agent --force"
      

Restart the logging system components

The logging system components include Fluentd, Elasticsearch, and Kibana. If required, you can restart these components.

To restart Fluentd

The Fluentd process that is responsible for collecting logs is td-agent (Treasure Data Agent). The td-agent process starts automatically when the fluentd Salt state is applied. To manually start and stop it, use the Salt commands.

The following example shows how to restart the td-agent process from the Salt Master node on all Salt Minion nodes with names that start with ctl:

# salt 'ctl*' service.restart td-agent

Alternatively, SSH to the node and use the service manager (systemd or upstart) to restart the service. For example:

# ssh ctl01.mcp-lab-advanced.local
# service td-agent restart

See the salt.modules.service documentation for more information on how to use the Salt service execution module.

To restart Elasticsearch

Run the following command from the Salt Master node:

# salt 'log*' service.restart elasticsearch

To restart Kibana

Run the following command from the Salt Master node:

# salt 'log*' service.restart kibana

Restart Telegraf

The Telegraf service is called telegraf.

To restart Telegraf on all nodes:

  1. Log in to the Salt Master node.

  2. Run the following command:

    salt -C 'I@telegraf:agent' service.restart telegraf
    

Restart InfluxDB

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

The InfluxDB service is called influxdb.

To restart InfluxDB on all nodes:

  1. Log in to the Salt Master node.

  2. Run the following command:

    salt -C 'I@influxdb:server' service.restart influxdb -b 1
    

Restart InfluxDB Relay

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

The InfluxDB Relay service is called influxdb-relay.

To restart InfluxDB Relay:

  1. Log in to the Salt Master node.

  2. Run the following command:

    salt -C 'I@influxdb:server' service.restart influxdb-relay -b 1
    

Restart Prometheus Relay and Prometheus long-term storage

You can restart Prometheus Relay and Prometheus long-term storage, for example, if one of the services hangs. Since these services are connected, you must restart both.

To restart Prometheus Relay and Prometheus long-term storage:

  1. Log in to the Salt Master node.

  2. Run the following commands:

    salt -C 'I@prometheus:relay' service.restart prometheus
    salt -C 'I@prometheus:relay' service.restart prometheus-relay
    

Manage endpoints, metrics, and alerts

You can easily configure Stacklight LMA to support new monitoring endpoints, add custom metrics and alerts, and modify or disable the existing alerts.

Add a custom monitoring endpoint

If required, you can add a custom monitoring endpoint to Prometheus, such as Calico, etcd, or Telegraf.

To add a custom monitoring endpoint:

  1. Log in to the Salt Master node.

  2. Configure the prometheus:server section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model as required. Add the monitoring endpoint IP and port.

    Example:

    prometheus:
      server:
        target:
          static:
            endpoint_name:
              endpoint:
                - address: 1.1.1.1
                  port: 10
                - address: 2.2.2.2
                  port: 10
    
  3. Apply the Salt formula:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
    

Add a custom metric

If required, you can add a custom metric, for example, to monitor a third-party software. This section describes how to add a custom metric to Telegraf.

Note

To add a custom metric to a new endpoint, you must first add the new endpoint to Prometheus as described in Add a custom monitoring endpoint.

To add a custom metric to Telegraf:

  1. Log in to the Salt Master node.

  2. Edit the telegraf section in the classes/cluster/cluster_name/init.yml file of the Reclass model as required.

    Example:

    telegraf:
      agent:
        input:
          procstat:
            process:
              memcached:
                exe: memcached
          memcached:
            servers:
              - address: {{ server.bind.address | replace("0.0.0.0", "127.0.0.1") }}
                port: {{ server.bind.port }}
    
  3. Apply the Telegraf Salt formula:

    salt -C 'I@linux:system' state.sls telegraf
    

Manage alerts

You can easily extend StackLight LMA to support a new service check by adding a custom alert. You may also need to modify or disable the default alerts as required.

To create a custom alert:

  1. Log in to the Salt Master node.

  2. Add the new alert to the prometheus:server:alert section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model. Enter the alert name, alerting conditions, severity level, and annotations that will be shown in the alert message.

    Example:

    prometheus:
      server:
        alert:
          EtcdFailedTotalIn5m:
            if: >-
              sum by(method) (rate(etcd_http_failed_total{code!~"4[0-9]{2}"}[5m]))
              / sum by(method) (rate(etcd_http_received_total[5m])) > {{
              prometheus_server.get('alert', {}).get('EtcdFailedTotalin5m', \
              {}).get('var', {}).get('threshold', 0.01) }}
            labels:
              severity: warning
              service: etcd
            annotations:
              summary: 'High number of HTTP requests are failing on etcd'
              description: '{{ $value }}% of requests for {{ $labels.method }} \
              failed on etcd instance {{ $labels.instance }}'
    
  3. Apply the Salt formula:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
    
  4. To view the new alert, see the Prometheus logs:

    docker service logs monitoring_server
    

    Alternatively, see the Alerts tab of the Prometheus web UI.

To modify a default alert:

  1. Log in to the Salt Master node.

  2. Modify the required alert in the prometheus:server:alert section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model.

  3. Apply the Salt formula:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
    
  4. To view the changes, see the Prometheus logs:

    docker service logs monitoring_server
    

    Alternatively, see the alert details in the Alerts tab of the Prometheus web UI.

To disable an alert:

  1. Log in to the Salt Master node.

  2. Create the required alert definition in the prometheus:server:alert section in the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model and set the enabled parameter to false.

    Example:

    prometheus:
      server:
        alert:
          EtcdClusterSmall:
            enabled: false
    
  3. Apply the Salt formula:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
    
  4. Verify the changes in the Alerts tab of the Prometheus web UI.

Configure StackLight LMA to send notifications

To enable StackLight LMA to send notifications, you must modify the Reclass model. By default, email notifications will be sent. However, you can configure StackLight LMA to send notifications to Salesforce, Slack, or other receivers. Additionally, you can specify a notification receiver for a particular alert or alert types or disable notifications.

Enable Alertmanager notifications

To enable StackLight LMA to send notifications, you must modify the Reclass model. By default, email notifications will be sent. However, you can configure StackLight LMA to send notifications to Salesforce, Slack, or other notifications receivers. You can also specify a notification channel for a particular alert or disable notifications.

Enable email or Slack notifications

This section describes how to enable StackLight LMA to send notifications to email, Slack, or to both notification channels using the Alertmanager service on an existing MCP cluster. By default, StackLight LMA uses Alertmanager and the SMTP protocol or the webhook receiver to send email or Slack notifications respectively.

Note

Skip this section if you require only email notifications and have already defined the variables for Alertmanager email notifications during the deployment model creation as described in MCP Deployment Guide: Infrastructure related parameters: Alertmanager email notifications.

To enable StackLight LMA to send notifications through Alertmanager:

  1. Log in to the Salt Master node.

  2. Open the classes/cluster/cluster_name/stacklight/server.yml file of the Reclass model for editing.

  3. Add the following classes:

    • For email notifications:

      classes:
      [...]
      - system.prometheus.alertmanager.notification.email
      - system.prometheus.server.alert.labels_add.route
      
    • For Slack notifications:

      classes:
      [...]
      - system.prometheus.alertmanager.notification.slack
      - system.prometheus.server.alert.labels_add.route
      
  4. Define the following variables:

    • For email notifications:

      parameters:
        _param:
          alertmanager_notification_email_from: <email_from>
          alertmanager_notification_email_host: <smtp_server:port>
          alertmanager_notification_email_password: <email_password>
          alertmanager_notification_email_require_tls: <email_require_tls>
          alertmanager_notification_email_to: <email_to>
          alertmanager_notification_email_username: <email_username>
      

      Note

      Using the alertmanager_notification_email_host parameter, specify both the host and the port number of the SMTP server. For example, host.com:25.

    • For Slack notifications:

      parameters:
        _param:
          alertmanager_notification_slack_api_url: https://hooks.slack.com/services/<webhook/integration/token>
      
  5. Set one or multiple notification channels by using the _param:prometheus_server_alert_label_route parameter. The default value is email, which means that email notifications will be sent.

    Example:

    parameters:
      _param:
        prometheus_server_alert_label_route: email;slack;
    
  6. Apply the Salt formulas:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.alertmanager -b 1
    
Enable Salesforce notifications

This section describes how to enable StackLight LMA to create Salesforce cases from Prometheus alerts on an existing cluster. StackLight LMA uses Alertmanager and the Salesforce notifier service to create the Salesforce cases.

Note

Skip this section if you have already defined the variables for Alertmanager Salesforce notifications during the deployment model creation as described in MCP Deployment Guide: General deployment parameters and MCP Deployment Guide: Infrastructure related parameters: Alertmanager Salesforce notifications.

If you configured Salesforce notifications through the Push Notification service, first proceed to Switch to Alertmanager-based notifications.

To enable StackLight LMA to send Salesforce notifications through Alertmanager:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In classes/cluster/<cluster_name>/stacklight/client.yml, specify:

    classes:
    - system.docker.swarm.stack.monitoring.sf_notifier
    
    [...]
    
    parameters:
      _params:
        docker_image_sf_notifier: "${_param:mcp_docker_registry}/openstack-docker/sf_notifier:${_param:mcp_version}"
    
  3. In classes/cluster/<cluster_name>/stacklight/server.yml, specify:

    classes:
    - system.prometheus.alertmanager.notification.salesforce
    - system.prometheus.sf_notifier.container
    
    [...]
    
    parameters:
      _params:
        sf_notifier_sfdc_auth_url: "<salesforce_instance_http_endpoint>"
        sf_notifier_sfdc_username: "<customer_account_email>"
        sf_notifier_sfdc_password: "<customer_account_password>"
        sf_notifier_sfdc_organization_id: "<organization_id>"
        sf_notifier_sfdc_environment_id: "<cloud_id>"
        sf_notifier_sfdc_sandbox_enabled: "True/False"
    

    Warning

    If you have previously configured email notifications through Alertmanager, verify that the prometheus_server_alert_label_route parameter in server.yml includes not only the email but also salesforce values.

  4. Log in to the Salt Master node.

  5. Refresh Salt pillars:

    salt '*' saltutil.refresh_pillar
    
  6. Create the directory structure for the Salesforce notifier service:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.sf_notifier
    
  7. Start the sf-notifier service in Docker container:

    salt -C 'I@docker:swarm:role:master' state.sls docker.client
    
  8. Update the Prometheus configuration to create metrics target and alerts:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b 1
    
  9. Update the Alertmanager configuration to create the webhook receiver:

    salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.alertmanager -b 1
    
Configure Alertmanager integrations

Note

This feature is available starting from the MCP 2019.2.9 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes how to enable StackLight LMA to send notifications to specific receivers such as PagerDuty, OpsGenie, and so on using the Alertmanager service on an existing MCP cluster. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

The changes performed within the following procedure are backward compatible with previous MCP releases. Therefore, you do not need to change any of the already configured Alertmanager receivers and routes.

To enable Alertmanager integrations:

  1. Open your project Git repository with the Reclass model on the cluster level.

  2. Open the <classes/cluster/<cluster>/stacklight/server.yml> file for editing.

  3. Configure the Alertmanager receiver route as required:

    parameters:
      prometheus:
        alertmanager:
          enabled: true
          config:
            route:
              routes:
                <route_name>:
                  receiver: <receiver_name>
                  match_re:
                    - label: '<name_of_the_alert_label>'
                      value: '<regex_to_identify_the_route>'
                  continue: true
            receiver:
              <receiver_name>:
                enabled: true
                generic_configs:  # <- here is the difference
                  <chosen_Alertmanager_receiver_type>:
                    <receiver_endpoint_name>:
                      <receiver_config>