StackLight

StackLight

In the MCP 2019.2.4 maintenance update, Mirantis introduces the following enhancements for StackLight LMA:

To obtain the enhancements, follow the steps described in Apply maintenance updates.


Elasticsearch and Kibana versions update

Updated Elasticsearch and Kibana from version 5.6.12 to 6.8.0.


Prometheus Elasticsearch exporter

Added support for Prometheus Elasticsearch exporter that periodically sends configured queries to the Elasticsearch cluster and exposes the results as Prometheus metrics that you can view in the Prometheus web UI.


TLS encryption for StackLight

Added the capability to encrypt the communication between Prometheus and Telegraf as well as Fluentd and Elasticsearch inside an MCP deployment over the Transport Layer Security (TLS) protocol.

Warning

The functionality does not cover encryption of the traffic between HAProxy and Elasticsearch.


VM state indicator

Implemented the openstack_nova_instance_status and libvirt_domain_info_state metrics to provide an overview of a VM status from the OpenStack perspective and state from the libvirt perspective. To view the metrics, use the Prometheus web UI.


Docker services logging

Added the capability for Fluentd to parse the Docker logs and send them to Elasticsearch. Now, you can view the Docker services logs in the Kibana web UI.


KPI measurements

Implemented the KPI Downtime and KPI Provisioning Grafana dashboards as well as the OVSInstanceArpingCheckDown and OpencontrailInstancePingCheckDownKey alerts to provide an overview of the infrastructure stability based on the following Key Performance Indicator (KPI) measurements:

Provisioning KPI

Provides the percentage of instances provisioning failures from the perspective of OpenStack notifications by tracking the compute.instance.create.start, compute.instance.create.end, and compute.instance.create.error Nova notifications and calculating the KPI on a daily basis. The measurements reset at midnight.

Downtime KPI

Provides the percentage of downtime check failures. Depending on the MCP cluster configuration, the downtime KPI includes the following measurements:

  • The states of instances from the OpenStack perspective. In this case, a check is considered as failed if the instance state is ERROR.

  • The instances network checks from the OVS or OpenContrail perspective:

    • For OVS, StackLight LMA performs Address Resolution Protocol (ARP) pings of the DHCP assigned IP address of the OpenStack instances. The check is considered as failed if all DHCP assigned IPs of the instance do not respond to ARP pings for 10 minutes.

    • For OpenContrail, StackLight LMA pings the link-local IP addresses of the OpenStack instances. The check is considered as failed if all link-local IPs of the instance do not respond to pings for 10 minutes.


Alerts optimization

Enhanced the StackLight LMA alerts to provide for a more optimized infrastructure monitoring.


CADF notifications handled by Fluentd

Added the capability for Fluentd to handle the OpenStack Cloud Auditing Data Federation (CADF) notifications instead of Heka. Deprecated the Heka service.

If required, you can configure Fluentd running on the RabbitMQ nodes to forward the Cloud Auditing Data Federation (CADF) events to specific external security information and event management (SIEM) systems. For details, see MCP Operations Guide: Enable sending CADF events to external SIEM systems.

To enable CADF notifications handling by Fluentd and remove Heka:

  1. On the cluster level of the Reclass model:

    1. In openstack/message_queue.yml, add the following class:

      - system.fluentd.label.notifications
      
    2. In stacklight/client.yml, remove the following class:

      - system.docker.swarm.stack.monitoring.remote_collector
      
    3. In stacklight/server.yml, remove the Heka classes:

      - system.heka.remote_collector.container
      - system.heka.remote_collector.input.amqp
      - system.heka.remote_collector.output.elasticsearch
      - system.heka.remote_collector.output.telegraf
      
  2. From the Salt Master node:

    1. Update the Fluentd configuration:

      salt -C "I@fluentd:agent" state.sls fluentd
      
    2. Apply the changes:

      salt -C "I@docker:swarm:role:master and I@prometheus:server" state.sls docker.client
      
    3. Remove the Docker service with Heka:

      salt -C "I@docker:swarm:role:master and I@prometheus:server" cmd.run 'docker service rm monitoring_remote_collector'