Glance
This section describes the alerts for Glance.
GlanceApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Glance API is not accessible for the Glance endpoint in the OpenStack
service catalog. |
Raise condition |
openstack_api_check_status{name="glance"} == 0 |
Description |
Raises when the checks against all available internal Glance endpoints
in the OpenStack service catalog do not pass. Telegraf sends HTTP
requests to the URLs from the OpenStack service catalog and compares the
expected and actual HTTP response codes. The expected response codes for
Glance are 200 and 300 . |
Troubleshooting |
Obtain the list of available endpoints using openstack endpoint list
and verify the availability of internal Glance endpoints (URLs) from the
list. |
Tuning |
Not required |
GlareApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Glare API is not accessible for the Glare endpoint in the OpenStack
service catalog. |
Raise condition |
openstack_api_check_status{name="glare"} == 0 |
Description |
Raises when the checks against all available internal Glare endpoints in
the OpenStack service catalog do not pass. Telegraf sends HTTP requests
to the URLs from the OpenStack service catalog and compares the expected
and actual HTTP response codes. The expected response codes for Glare
are 200 and 300 . |
Troubleshooting |
Obtain the list of available endpoints using openstack endpoint list
and verify the availability of internal Glance endpoints (URLs) from the
list. |
Tuning |
Not required |
GlanceApiEndpointDown
Severity |
Minor |
Summary |
The {{ $labels.name }} endpoint on the {{ $labels.host }} node
is not accessible for 2 minutes. |
Raise condition |
http_response_status{name=~"glance.*"} == 0 |
Description |
Raises when the check against the Glance API endpoint does not pass,
typically meaning that the service endpoint is down or unreachable due
to connectivity issues. The host label in the raised alert contains
the host name of the affected node. Telegraf sends an HTTP request to
the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
GlanceApiEndpointsDownMajor
Severity |
Major |
Summary |
More than 50% of {{ $labels.name }} endpoints are not accessible for
2 minutes. |
Raise condition |
count by(name) (http_response_status{name=~"glance.*"} == 0) >= count
by(name) (http_response_status{name=~"glance.*"}) * 0.5 |
Description |
Raises when the check against the Glance API endpoint does not pass on
more than 50% of the ctl nodes, typically meaning that the service
endpoint is down or unreachable due to connectivity issues. For
details on the affected nodes, see the host label in the
GlanceApiEndpointDown alerts. Telegraf sends an HTTP request to the
URL configured in /etc/telegraf/telegraf.d/input-http_response.conf
on the corresponding node and compares the expected and actual HTTP
response codes from the configuration file. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
GlanceApiEndpointsOutage
Severity |
Critical |
Summary |
All available {{ $labels.name }} endpoints are not accessible for 2
minutes. |
Raise condition |
count by(name) (http_response_status{name=~"glance.*"} == 0) ==
count by(name) (http_response_status{name=~"glance.*"}) |
Description |
Raises when the check against the Glance API endpoint does not pass on
all controller nodes, typically meaning that the service endpoint is
down or unreachable due to connectivity issues. For details on the
affected nodes, see the host label in the GlanceApiEndpointDown
alerts. Telegraf sends an HTTP request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
GlanceErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in Glance logs on the
{{ $labels.host }} node is {{ $value }} (as measured over the
last 5 minutes). |
Raise condition |
sum without(level)
(rate(log_messages{level=~"(?i:(error|emergency|fatal))",service="glance"}[5m]))
> 0.2 |
Description |
Raises when the average per-second rate of error , fatal or
emergency messages in Glance logs on the node is more than 0.2
messages per second. The host label in the raised alert contains the
affected node. Fluentd forwards all logs from Glance to Elasticsearch
and counts the number of log messages by severity. |
Troubleshooting |
Inspect the log files in /var/log/glance/ on the corresponding node. |
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, inspect the Glance error logs in Kibana and adjust
the threshold to an acceptable error rate for a particular environment.
In the Prometheus Web UI, use the raise condition query to view the
appearance rate of a particular message type in logs for a longer period
of time and define the best threshold.
For example, to change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
GlanceErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service="glance", \
level=~"(?i:(error|emergency|fatal))"}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|