In the MCP 2019.2.4 maintenance update, Mirantis introduces the following enhancements for StackLight LMA:
To obtain the enhancements, follow the steps described in Apply maintenance updates.
Updated Elasticsearch and Kibana from version 5.6.12 to 6.8.0.
Added support for Prometheus Elasticsearch exporter that periodically sends configured queries to the Elasticsearch cluster and exposes the results as Prometheus metrics that you can view in the Prometheus web UI.
Learn more
Added the capability to encrypt the communication between Prometheus and Telegraf as well as Fluentd and Elasticsearch inside an MCP deployment over the Transport Layer Security (TLS) protocol.
Warning
The functionality does not cover encryption of the traffic between HAProxy and Elasticsearch.
Implemented the openstack_nova_instance_status
and
libvirt_domain_info_state
metrics to provide an overview of a VM status
from the OpenStack perspective and state from the libvirt perspective. To view
the metrics, use the Prometheus web UI.
Added the capability for Fluentd to parse the Docker logs and send them to Elasticsearch. Now, you can view the Docker services logs in the Kibana web UI.
Implemented the KPI Downtime and KPI Provisioning
Grafana dashboards as well as the OVSInstanceArpingCheckDown
and
OpencontrailInstancePingCheckDownKey
alerts to provide an overview of the
infrastructure stability based on the following Key Performance Indicator (KPI)
measurements:
Provides the percentage of instances provisioning failures from the
perspective of OpenStack notifications by tracking the
compute.instance.create.start
, compute.instance.create.end
, and
compute.instance.create.error
Nova notifications and calculating the KPI
on a daily basis. The measurements reset at midnight.
Provides the percentage of downtime check failures. Depending on the MCP cluster configuration, the downtime KPI includes the following measurements:
The states of instances from the OpenStack perspective. In this case, a
check is considered as failed if the instance state is ERROR
.
The instances network checks from the OVS or OpenContrail perspective:
For OVS, StackLight LMA performs Address Resolution Protocol (ARP) pings of the DHCP assigned IP address of the OpenStack instances. The check is considered as failed if all DHCP assigned IPs of the instance do not respond to ARP pings for 10 minutes.
For OpenContrail, StackLight LMA pings the link-local IP addresses of the OpenStack instances. The check is considered as failed if all link-local IPs of the instance do not respond to pings for 10 minutes.
Learn more
Enhanced the StackLight LMA alerts to provide for a more optimized infrastructure monitoring.
Reconsidered the severities of the RabbitMQ*
alerts and adjusted the
Alertmanager*
and SystemMemory*
alerts.
Restructured and enhanced the alerts documentation to provide alerts customization capabilities and troubleshooting recommendations, as well as list the alerts that require post-deployment tuning according to the deployment configuration.
Added the following alerts:
Removed the inefficient ContrailFLows*
, NovaHypervisor*
,
NovaAggregate*
, NovaTotalVCPUs*
, NovaTotalMemory*
, and
NovaTotalDisk*
, MemcachedServiceRespawn
,
MemcachedItemsNoneMinor
, SystemSwap*
,
PrometheusTargetSamples*
, and PrometheusDataIngestionWarning
alerts.
Note
These alerts will be removed automatically when updating to MCP 2019.2.4. However, if you have modified any of these alerts, you must remove them manually as described in MCP Operations Guide: Manage alerts.
Learn more
Added the capability for Fluentd to handle the OpenStack Cloud Auditing Data Federation (CADF) notifications instead of Heka. Deprecated the Heka service.
If required, you can configure Fluentd running on the RabbitMQ nodes to forward the Cloud Auditing Data Federation (CADF) events to specific external security information and event management (SIEM) systems. For details, see MCP Operations Guide: Enable sending CADF events to external SIEM systems.
To enable CADF notifications handling by Fluentd and remove Heka:
On the cluster level of the Reclass model:
In openstack/message_queue.yml
, add the following class:
- system.fluentd.label.notifications
In stacklight/client.yml
, remove the following class:
- system.docker.swarm.stack.monitoring.remote_collector
In stacklight/server.yml
, remove the Heka classes:
- system.heka.remote_collector.container
- system.heka.remote_collector.input.amqp
- system.heka.remote_collector.output.elasticsearch
- system.heka.remote_collector.output.telegraf
From the Salt Master node:
Update the Fluentd configuration:
salt -C "I@fluentd:agent" state.sls fluentd
Apply the changes:
salt -C "I@docker:swarm:role:master and I@prometheus:server" state.sls docker.client
Remove the Docker service with Heka:
salt -C "I@docker:swarm:role:master and I@prometheus:server" cmd.run 'docker service rm monitoring_remote_collector'