Documentation Portal

Keystone

Keystone¶

This section describes the alerts for Keystone.

KeystoneApiOutage
KeystoneApiEndpointDown
KeystoneApiEndpointssDownMajor
KeystoneApiEndpointsOutage
KeystoneErrorLogsTooHigh
KeystoneApiResponseTimeTooHigh
KeystoneKeysRotationFailure

KeystoneApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Keystone API is not accessible for the Keystone endpoint in the OpenStack service catalog.
Raise condition	`openstack_api_check_status{name=~"keystone.*"} == 0`
Description	Raises when the checks against all available internal Keystone endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Keystone are `200` and `300`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Keystone endpoints (URLs) from the output of `openstack endpoint list`.
Tuning	Not required

KeystoneApiEndpointDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"keystone.*"} == 0`
Description	Raises when the check against the Keystone API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

KeystoneApiEndpointssDownMajor¶

Severity	Major
Summary	`{{ $value }}` `{{ $labels.name }}` endpoints (>= 50%) are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"keystone."} == 0) >= count by(name) (http_response_status{name=~"keystone."}) * 0.5`
Description	Raises when the check against a Keystone API endpoint does not pass on more than 50% of the `ctl` nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `KeystoneApiEndpointDown` for the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

KeystoneApiEndpointsOutage¶

Severity	Critical
Summary	All available `{{ $labels.name }}` endpoints are not accessible for 2 minutes.
Raise condition	`count by(name) (http_response_status{name=~"keystone."} == 0) == count by(name) (http_response_status{name=~"keystone."})`
Description	Raises when the check against a Keystone API endpoint does not pass on all OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends an HTTP request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `KeystoneApiEndpointDown` for the affected nodes. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf`. Verify the configured URL availability using `curl`.
Tuning	Not required

KeystoneErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in the Keystone logs on the `{{ $labels.host }}` node is `{{ $value }}` (as measured over the last 5 minutes).
Raise condition	In 2019.2.10 and prior: `sum without(level)(rate(log_messages{level=~"(?i:(error\|emergency\| fatal))",service="keystone"}[5m])) > 0.2` In 2019.2.11 and newer: `sum(rate(log_messages{service=~"keystone\|keystone-wsgi",level=~"(?i: (error\|emergency\|fatal))"}[5m])) without (level) > {{ log_threshold }}`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in Keystone logs on the node is more than 0.2 per second. Fluentd forwards all logs from Cinder to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/keystone/` directory on the corresponding node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Keystone error logs in Kibana. You can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: KeystoneErrorLogsTooHigh: if: >- sum(rate(log_messages{service="keystone", level=~"(?i:\ (error\|emergency\|fatal))"}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

KeystoneApiResponseTimeTooHigh¶

Severity	Warning
Summary	The Keystone API response time for GET and POST requests on the `{{ $labels.host }}` node is higher than 3 seconds for 2 minutes.
Raise condition	`max by(host) (openstack_http_response_times{http_method=~"^(GET\|POST) $", http_status=~"^2..$", quantile="0.9", service="keystone"}) >= 3`
Description	Raises when the GET and POST requests to the Keystone API take more than 3 seconds.
Troubleshooting	Verify the performance of the OpenStack controller node.
Tuning	For example, to change the threshold to 5 seconds: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: KeystoneApiResponseTimeTooHigh: if: >- max by(host) (openstack_http_response_times{http_method=~"^\ (GET\|POST)$", http_status=~"^2..$", quantile="0.9", \ service="keystone"}) >= 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

KeystoneKeysRotationFailure¶

^{Available since the 2019.2.11 maintenance update}

Severity	Major
Summary	Keystone keys rotation failure.
Raise condition	`increase(log_messages{service="keystone-keys-rotation",level="ERROR"} [2h]) > 0`
Description	Raises when a Keystone user failed to rotate fernet or credential keys across the OpenStack control nodes. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the `/var/log/keystone/keystone-rotate.log` log on the affected node.
Tuning	Not required

updated: 2025-01-10 08:56

Ironic

View Previous Section

Neutron

View Next Section