Keystone
This section describes the alerts for Keystone.
KeystoneApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Keystone API is not accessible for the Keystone endpoint in the
OpenStack service catalog. |
Raise condition |
openstack_api_check_status{name=~"keystone.*"} == 0 |
Description |
Raises when the checks against all available internal Keystone endpoints
in the OpenStack service catalog do not pass. Telegraf sends HTTP
requests to the URLs from the OpenStack service catalog and compares the
expected and actual HTTP response codes. The expected response codes for
Keystone are 200 and 300 . For a list of all available endpoints,
run openstack endpoint list . |
Troubleshooting |
Verify the availability of internal Keystone endpoints (URLs) from the
output of openstack endpoint list . |
Tuning |
Not required |
KeystoneApiEndpointDown
Severity |
Minor |
Summary |
The {{ $labels.name }} endpoint on the {{ $labels.host }} node
is not accessible for 2 minutes. |
Raise condition |
http_response_status{name=~"keystone.*"} == 0 |
Description |
Raises when the check against the Keystone API endpoint does not pass,
typically meaning that the service endpoint is down or unreachable due
to connectivity issues. Telegraf sends an HTTP request to the URL
configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. The host label in the raised
alert contains the host name of the affected node. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
KeystoneApiEndpointssDownMajor
Severity |
Major |
Summary |
{{ $value }} {{ $labels.name }} endpoints (>= 50%) are not
accessible for 2 minutes. |
Raise condition |
count by(name) (http_response_status{name=~"keystone.*"} == 0) >=
count by(name) (http_response_status{name=~"keystone.*"}) * 0.5 |
Description |
Raises when the check against a Keystone API endpoint does not pass on
more than 50% of the ctl nodes, typically indicating that the
service endpoint is down or unreachable due to connectivity issues.
Telegraf sends an HTTP request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the
KeystoneApiEndpointDown for the affected nodes.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
KeystoneApiEndpointsOutage
Severity |
Critical |
Summary |
All available {{ $labels.name }} endpoints are not accessible for 2
minutes. |
Raise condition |
count by(name) (http_response_status{name=~"keystone.*"} == 0) ==
count by(name) (http_response_status{name=~"keystone.*"}) |
Description |
Raises when the check against a Keystone API endpoint does not pass on
all OpenStack controller nodes, typically indicating that the service
endpoint is down or unreachable due to connectivity issues. Telegraf
sends an HTTP request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the
KeystoneApiEndpointDown for the affected nodes.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
KeystoneErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in the Keystone logs on the
{{ $labels.host }} node is {{ $value }} (as measured over the
last 5 minutes). |
Raise condition |
- In 2019.2.10 and prior:
sum without(level)(rate(log_messages{level=~"(?i:(error|emergency|
fatal))",service="keystone"}[5m])) > 0.2
- In 2019.2.11 and newer:
sum(rate(log_messages{service=~"keystone|keystone-wsgi",level=~"(?i:
(error|emergency|fatal))"}[5m])) without (level) > {{ log_threshold }}
|
Description |
Raises when the average per-second rate of the error , fatal , or
emergency messages in Keystone logs on the node is more than 0.2 per
second. Fluentd forwards all logs from Cinder to Elasticsearch and
counts the number of log messages per severity. The host label in
the raised alert contains the host name of the affected node. |
Troubleshooting |
Inspect the log files in the /var/log/keystone/ directory on the
corresponding node. |
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, inspect the Keystone error logs in Kibana. You can
adjust the threshold to an acceptable error rate for a particular
environment. In the Prometheus web UI, use the raise condition query to
view the appearance rate of a particular message type in logs for a
longer period of time and define the best threshold.
For example, to change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
KeystoneErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service="keystone", level=~"(?i:\
(error|emergency|fatal))"}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
KeystoneApiResponseTimeTooHigh
Severity |
Warning |
Summary |
The Keystone API response time for GET and POST requests on the
{{ $labels.host }} node is higher than 3 seconds for 2 minutes. |
Raise condition |
max by(host) (openstack_http_response_times{http_method=~"^(GET|POST)
$", http_status=~"^2..$", quantile="0.9", service="keystone"}) >= 3 |
Description |
Raises when the GET and POST requests to the Keystone API take more
than 3 seconds. |
Troubleshooting |
Verify the performance of the OpenStack controller node. |
Tuning |
For example, to change the threshold to 5 seconds:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
KeystoneApiResponseTimeTooHigh:
if: >-
max by(host) (openstack_http_response_times{http_method=~"^\
(GET|POST)$", http_status=~"^2..$", quantile="0.9", \
service="keystone"}) >= 5
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
KeystoneKeysRotationFailure
Available since the 2019.2.11 maintenance update
Severity |
Major |
Summary |
Keystone keys rotation failure. |
Raise condition |
increase(log_messages{service="keystone-keys-rotation",level="ERROR"}
[2h]) > 0 |
Description |
Raises when a Keystone user failed to rotate fernet or credential keys
across the OpenStack control nodes. The host label in the raised
alert contains the host name of the affected node. |
Troubleshooting |
Inspect the /var/log/keystone/keystone-rotate.log log on the
affected node. |
Tuning |
Not required |