Cinder
This section describes the alerts for Cinder.
CinderApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Cinder API is not accessible for all available Cinder endpoints in the
OpenStack service catalog. |
Raise condition |
max(openstack_api_check_status{name=~"cinder.*"}) == 0 |
Description |
Raises when the checks against all available internal Cinder endpoints
in the OpenStack service catalog do not pass. Telegraf sends HTTP
requests to the URLs from the OpenStack service catalog and compares the
expected and actual HTTP response codes. The expected response codes for
Cinder, Cinderv2, and Cinderv3 are 200 and 300 . For a list of
all available endpoints, run openstack endpoint list . |
Troubleshooting |
Verify the availability of internal Cinder endpoints (URLs) from the
output of openstack endpoint list . |
Tuning |
Not required |
CinderApiDown
Removed since the 2019.2.11 maintenance update
Severity |
Major |
Summary |
Cinder API is not accessible for the {{ $labels.name }} endpoint. |
Raise condition |
openstack_api_check_status{name=~"cinder.*"} == 0 |
Description |
Raises when the check against one available internal Cinder
endpoints in the OpenStack service catalog does not pass. Telegraf sends
HTTP requests to the URLs from the OpenStack service catalog and
compares the expected and actual HTTP response codes. The expected
response codes for Cinder, Cinderv2, and Cinderv3 are 200 and
300 . For a list of all available endpoints, run
openstack endpoint list . |
Troubleshooting |
Verify the availability of internal Cinder endpoints (URLs) from the
output of the openstack endpoint list command. |
Tuning |
Not required |
CinderApiEndpointDown
Severity |
Minor |
Summary |
The cinder-api endpoint on the {{ $labels.host }} node is not
accessible for 2 minutes. |
Raise condition |
http_response_status{name=~"cinder-api"} == 0 |
Description |
Raises when the check against a Cinder API endpoint does not pass,
typically meaning that the service endpoint is down or unreachable due
to connectivity issues. The host label in the raised alert contains
the host name of the affected node.
Telegraf sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
CinderApiEndpointDownMajor
Severity |
Major |
Summary |
More than 50% of cinder-api endpoints are not accessible for 2 minutes. |
Raise condition |
count(http_response_status{name=~"cinder-api"} == 0) >=
count(http_response_status{name=~"cinder-api"}) * 0.5 |
Description |
Raises when the check against a Cinder API endpoint does not pass on
more than 50% of OpenStack controller nodes. For details on the affected
nodes, see the host label in the CinderApiEndpointDown alerts.
Telegraf sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the
CinderApiEndpointDown alerts for the host names of the
affected nodes.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
CinderApiEndpointsOutage
Severity |
Critical |
Summary |
All available cinder-api endpoints are not accessible for 2 minutes. |
Raise condition |
count(http_response_status{name=~"cinder-api"} == 0) ==
count(http_response_status{name=~"cinder-api"}) |
Description |
Raises when the check against a Cinder API endpoint does not pass on all
OpenStack controller nodes.
Telegraf sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the
CinderApiEndpointDown alerts for the host names of the
affected nodes.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
CinderServiceDown
Severity |
Minor |
Summary |
The {{ $labels.binary }} service on the {{ $labels.hostname }}
node is down. |
Raise condition |
openstack_cinder_service_state == 0 |
Description |
Raises when a Cinder service on the OpenStack controller or compute node
is in the DOWN state.
For the list of Cinder services, see Cinder Block Storage
service overview.
The binary and hostname labels contain the name of the service
that is in the DOWN state and the node that hosts the service. |
Troubleshooting |
- Verify the list of Cinder services and their states using
openstack volume service list .
- Verify the status of the corresponding Cinder service on the affected
node using
systemctl service <binary> .
- Inspect the logs of the corresponding Cinder service on the affected
node in the
/var/log/cinder/ directory.
- Verify the Telegraf
monitoring_remote_agent service:
- Verify the status of the
monitoring_remote_agent service using
docker service ls .
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the mon
nodes.
|
Tuning |
Not required |
CinderServicesDownMinor
Severity |
Minor |
Summary |
More than 30% of {{ $labels.binary }} services are down. |
Raise condition |
count by(binary) (openstack_cinder_service_state == 0) >= on(binary)
count by(binary) (openstack_cinder_service_state) * 0.3 |
Description |
Raises when a Cinder service is in the DOWN state on more than 30% of
the ctl or cmp hosts. For the list of services, see
Cinder Block Storage service
overview.
Inspect the hostname label in the CinderServiceDown alerts for
details on the affected services and nodes. |
Troubleshooting |
- Verify the list of Cinder services and their states using
openstack volume service list .
- Verify the status of the corresponding Cinder service on the affected
node using
systemctl service <binary> .
- Inspect the logs of the corresponding Cinder service on the affected
node in the
/var/log/cinder/ directory.
- Verify the Telegraf
monitoring_remote_agent service:
- Verify the status of the
monitoring_remote_agent service using
docker service ls .
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the mon
nodes.
|
Tuning |
Not required |
CinderServicesDownMajor
Severity |
Major |
Summary |
More than 60% of {{ $labels.binary }} services are down. |
Raise condition |
count by(binary) (openstack_cinder_service_state == 0) >= on(binary)
count by(binary) (openstack_cinder_service_state) * 0.6 |
Description |
Raises when a Cinder service is in the DOWN state on more than 60% of
the ctl or cmp hosts. For the list of services, see
Cinder Block Storage service overview.
Inspect the hostname label in the CinderServiceDown alerts for
details on the affected services and nodes. |
Troubleshooting |
- Verify the list of Cinder services and their states using
openstack volume service list .
- Verify the status of the corresponding Cinder service on the affected
node using
systemctl service <binary> .
- Inspect the logs of the corresponding Cinder service on the affected
node in the
/var/log/cinder/ directory.
- Verify the Telegraf
monitoring_remote_agent service:
- Verify the status of the
monitoring_remote_agent service using
docker service ls .
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the mon
nodes.
|
Tuning |
Not required |
CinderServiceOutage
Severity |
Critical |
Summary |
All {{ $labels.binary }} services are down. |
Raise condition |
count by(binary) (openstack_cinder_service_state == 0) == on(binary)
count by(binary) (openstack_cinder_service_state) |
Description |
Raises when a Cinder service is in the DOWN state on all ctl or
cmp hosts. For the list of services, see
Cinder Block Storage service
overview.
Inspect the hostname label in the CinderServiceDown alerts for
details on the affected services and nodes. |
Troubleshooting |
- Verify the list of Cinder services and their states using
openstack volume service list .
- Verify the status of the corresponding Cinder service on the affected
node using
systemctl service <binary> .
- Inspect the logs of the corresponding Cinder service on the affected
node in the
/var/log/cinder/ directory.
- Verify the Telegraf
monitoring_remote_agent service:
- Verify the status of the
monitoring_remote_agent service using
docker service ls .
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the mon
nodes.
|
Tuning |
Not required |
CinderVolumeProcessDown
Available starting from the 2019.2.8 maintenance update
Severity |
Minor |
Summary |
A cinder-volume process is down. |
Raise condition |
procstat_running{process_name="cinder-volume"} == 0 |
Description |
Raises when a cinder-volume process on a node is down. The host
label in the raised alert contains the affected node. |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status cinder-volume .
- Inspect the
cinder-volume log files in the /var/log/cinder/
directory.
|
Tuning |
Not required |
CinderVolumeProcessesDownMinor
Available starting from the 2019.2.8 maintenance update
Severity |
Minor |
Summary |
30% of cinder-volume processes are down. |
Raise condition |
count(procstat_running{process_name="cinder-volume"} == 0) >= count
(procstat_running{process_name="cinder-volume"}) *
{{ minor_threshold }} |
Description |
Raises when at least one cinder-volume process is down. Includes
the number of cinder-volume processes in the DOWN state
(>= {%- endraw %}{{minor_threshold*100}}%{%- raw %} ). |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status cinder-volume .
- Inspect the
cinder-volume log files in the /var/log/cinder/
directory.
|
Tuning |
Not required |
CinderVolumeProcessesDownMajor
Available starting from the 2019.2.8 maintenance update
Severity |
Major |
Summary |
60% of cinder-volume processes are down. |
Raise condition |
count(procstat_running{process_name="cinder-volume"} == 0) >= count
(procstat_running{process_name="cinder-volume"}) *
{{ major_threshold }} |
Description |
Raises when at least two cinder-volume processes are down. Includes
the number of cinder-volume processes in the DOWN state
(>= {%- endraw %}{{major_threshold*100}}%{%- raw %} ). |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status cinder-volume .
- Inspect the
cinder-volume log files in the /var/log/cinder/
directory.
|
Tuning |
Not required |
CinderVolumeServiceOutage
Available starting from the 2019.2.8 maintenance update
Severity |
Critical |
Summary |
The cinder-volume service is down. |
Raise condition |
count(procstat_running{process_name="cinder-volume"} == 0) == count
(procstat_running{process_name="cinder-volume"}) |
Description |
Raises when all cinder-volume processes are down. |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status cinder-volume .
- Inspect the
cinder-volume log files in the /var/log/cinder/
directory.
|
Tuning |
Not required |
CinderErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in Cinder logs on the
{{ $labels.host }} node is larger than 0.2 messages. |
Raise condition |
sum without(level)
(rate(log_messages{level=~"(?i:(error|emergency|fatal))",
service="cinder"}[5m])) > 0.2 |
Description |
Raises when the average per-second rate of error , fatal , or
emergency messages in Cinder logs on the node is more than 0.2 per
second. The host label in the raised alert contains the affected
node. Fluentd forwards all logs from Cinder to Elasticsearch and counts
the number of log messages per severity. |
Troubleshooting |
- Inspect the log files in the
/var/log/cinder/ directory on the
corresponding node.
- Inspect Cinder logs in the Kibana web UI.
|
Tuning description |
Typically, you should not change the default value. However, you can
adjust the threshold to an acceptable error rate for a particular
environment. In the Prometheus Web UI, use the raise condition query
to view the appearance rate of a particular message type in logs for a
longer period of time and define the best threshold.
To change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
CinderErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service="cinder", \
level=~"(?i:(error|emergency|fatal))"}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|