Heat

Heat

This section describes the alerts for Heat.


HeatApiOutage

Removed since the 2019.2.11 maintenance update

Severity

Critical

Summary

Heat API is not accessible for all available Heat endpoints in the OpenStack service catalog.

Raise condition

max(openstack_api_check_status{name=~"heat.*"}) == 0

Description

Raises when the checks against all available internal Heat endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are 200 and 300 for Heat and 200, 300, and 400 for Heat CFN. For a list of all available endpoints run openstack endpoint list.

Troubleshooting

Verify the availability of internal Heat endpoints (URLs) from the output of openstack endpoint list.

Tuning

Not required

HeatApiDown

Removed since the 2019.2.11 maintenance update

Severity

Major

Summary

Heat API is not accessible for the {{ $labels.name }} endpoint.

Raise condition

openstack_api_check_status{name=~"heat.*"} == 0

Description

Raises when the checks against one of the available internal Heat endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the Openstack service catalog and compares the expected and actual HTTP response codes. The expected response codes are 200 and 300 for Heat and 200, 300, and 400 for Heat CFN. For a list of all available endpoints run openstack endpoint list.

Troubleshooting

Verify the availability of internal Heat endpoints (URLs) from the output of openstack endpoint list.

Tuning

Not required

HeatApiEndpointDown

Severity

Minor

Summary

The {{ $labels.name }} endpoint on the {{ $labels.host }} node is not accessible for 2 minutes.

Raise condition

http_response_status{name=~"heat.*-api"} == 0

Description

Raises when the check against a Heat API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. The host label in the raised alert contains the hostname of the affected node. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

HeatApiEndpointsDownMajor

Severity

Major

Summary

{{ $value }} {{ $labels.name }} endpoints (>= 50%) are not accessible for 2 minutes.

Raise condition

count by(name) (http_response_status{name=~"heat.*-api"} == 0) >= count by(name) (http_response_status{name=~"heat.*-api"}) * 0.5

Description

Raises when the check against a Heat API endpoint does not pass on more than 50% of the ctl nodes, typically meaning that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the HeatApiEndpointDown alerts for the host names of the affected nodes.

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

HeatApiEndpointsOutage

Severity

Critical

Summary

All available {{ $labels.name }} endpoints are not accessible for 2 minutes.

Raise condition

count by(name) (http_response_status{name=~"heat.*-api"} == 0) == count by(name) (http_response_status{name=~"heat.*-api"})

Description

Raises when the check against a Heat API endpoint does not pass on all OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the HeatApiEndpointDown alerts for the host names of the affected nodes.

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

HeatErrorLogsTooHigh

Severity

Warning

Summary

The average per-second rate of errors in Heat logs on the {{ $labels.host }} node is {{ $value }} as measured over the last 5 minutes.

Raise condition

sum without(level) (rate(log_messages{level=~"(?i:(error|emergency| fatal))",service="heat"}[5m])) > 0.2

Description

Raises when the average per-second rate of error, fatal or emergency messages in Heat logs on the node is more than 0.2 per second. Fluentd forwards all logs from Cinder to Elasticsearch and counts the number of log messages per severity. The host label in the raised alert contains the affected node.

Troubleshooting

Inspect the log files in the /var/log/heat/ directory on the corresponding node.

Tuning

Typically, you should not change the default value. If the alert is constantly firing, inspect the Heat error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold.

For example, to change the threshold to 0.4:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            HeatErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service=""heat"", level=~""(?i:\
                (error|emergency|fatal))""}[5m])) without (level) > 0.4
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.