Heat
This section describes the alerts for Heat.
HeatApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Heat API is not accessible for all available Heat endpoints in the
OpenStack service catalog. |
Raise condition |
max(openstack_api_check_status{name=~"heat.*"}) == 0 |
Description |
Raises when the checks against all available internal Heat endpoints in
the OpenStack service catalog do not pass. Telegraf sends HTTP requests
to the URLs from the OpenStack service catalog and compares the expected
and actual HTTP response codes. The expected response codes are 200
and 300 for Heat and 200 , 300 , and 400 for Heat CFN.
For a list of all available endpoints run openstack endpoint list . |
Troubleshooting |
Verify the availability of internal Heat endpoints (URLs) from the
output of openstack endpoint list . |
Tuning |
Not required |
HeatApiDown
Removed since the 2019.2.11 maintenance update
Severity |
Major |
Summary |
Heat API is not accessible for the {{ $labels.name }} endpoint. |
Raise condition |
openstack_api_check_status{name=~"heat.*"} == 0 |
Description |
Raises when the checks against one of the available internal Heat
endpoints in the OpenStack service catalog do not pass. Telegraf sends
HTTP requests to the URLs from the Openstack service catalog and
compares the expected and actual HTTP response codes. The expected
response codes are 200 and 300 for Heat and 200 , 300 ,
and 400 for Heat CFN. For a list of all available endpoints run
openstack endpoint list . |
Troubleshooting |
Verify the availability of internal Heat endpoints (URLs) from the
output of openstack endpoint list . |
Tuning |
Not required |
HeatApiEndpointDown
Severity |
Minor |
Summary |
The {{ $labels.name }} endpoint on the {{ $labels.host }} node
is not accessible for 2 minutes. |
Raise condition |
http_response_status{name=~"heat.*-api"} == 0 |
Description |
Raises when the check against a Heat API endpoint does not pass,
typically meaning that the service endpoint is down or unreachable due
to connectivity issues. The host label in the raised alert contains
the hostname of the affected node. Telegraf sends a request to the URL
configured in /etc/telegraf/telegraf.d/input-http_response.conf on
the corresponding node and compares the expected and actual HTTP
response codes from the configuration file. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
HeatApiEndpointsDownMajor
Severity |
Major |
Summary |
{{ $value }} {{ $labels.name }} endpoints (>= 50%) are not
accessible for 2 minutes. |
Raise condition |
count by(name) (http_response_status{name=~"heat.*-api"} == 0) >=
count by(name) (http_response_status{name=~"heat.*-api"}) * 0.5 |
Description |
Raises when the check against a Heat API endpoint does not pass on more
than 50% of the ctl nodes, typically meaning that the service
endpoint is down or unreachable due to connectivity issues. Telegraf
sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the
HeatApiEndpointDown alerts for the host names of the
affected nodes.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
HeatApiEndpointsOutage
Severity |
Critical |
Summary |
All available {{ $labels.name }} endpoints are not accessible for
2 minutes. |
Raise condition |
count by(name) (http_response_status{name=~"heat.*-api"} == 0) ==
count by(name) (http_response_status{name=~"heat.*-api"}) |
Description |
Raises when the check against a Heat API endpoint does not pass on all
OpenStack controller nodes, typically indicating that the service
endpoint is down or unreachable due to connectivity issues. Telegraf
sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the
HeatApiEndpointDown alerts for the host names of the
affected nodes.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
HeatErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in Heat logs on the
{{ $labels.host }} node is {{ $value }} as measured over the
last 5 minutes. |
Raise condition |
sum without(level) (rate(log_messages{level=~"(?i:(error|emergency|
fatal))",service="heat"}[5m])) > 0.2 |
Description |
Raises when the average per-second rate of error , fatal or
emergency messages in Heat logs on the node is more than 0.2 per
second. Fluentd forwards all logs from Cinder to Elasticsearch and
counts the number of log messages per severity. The host label in
the raised alert contains the affected node. |
Troubleshooting |
Inspect the log files in the /var/log/heat/ directory on the
corresponding node. |
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, inspect the Heat error logs in the Kibana web UI.
However, you can adjust the threshold to an acceptable error rate for a
particular environment. In the Prometheus Web UI, use the raise
condition query to view the appearance rate of a particular message type
in logs for a longer period of time and define the best threshold.
For example, to change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
HeatErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service=""heat"", level=~""(?i:\
(error|emergency|fatal))""}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|