Cinder

Cinder

This section describes the alerts for Cinder.


CinderApiOutage

Removed since the 2019.2.11 maintenance update

Severity

Critical

Summary

Cinder API is not accessible for all available Cinder endpoints in the OpenStack service catalog.

Raise condition

max(openstack_api_check_status{name=~"cinder.*"}) == 0

Description

Raises when the checks against all available internal Cinder endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Cinder, Cinderv2, and Cinderv3 are 200 and 300. For a list of all available endpoints, run openstack endpoint list.

Troubleshooting

Verify the availability of internal Cinder endpoints (URLs) from the output of openstack endpoint list.

Tuning

Not required

CinderApiDown

Removed since the 2019.2.11 maintenance update

Severity

Major

Summary

Cinder API is not accessible for the {{ $labels.name }} endpoint.

Raise condition

openstack_api_check_status{name=~"cinder.*"} == 0

Description

Raises when the check against one available internal Cinder endpoints in the OpenStack service catalog does not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Cinder, Cinderv2, and Cinderv3 are 200 and 300. For a list of all available endpoints, run openstack endpoint list.

Troubleshooting

Verify the availability of internal Cinder endpoints (URLs) from the output of the openstack endpoint list command.

Tuning

Not required

CinderApiEndpointDown

Severity

Minor

Summary

The cinder-api endpoint on the {{ $labels.host }} node is not accessible for 2 minutes.

Raise condition

http_response_status{name=~"cinder-api"} == 0

Description

Raises when the check against a Cinder API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. The host label in the raised alert contains the host name of the affected node. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

CinderApiEndpointDownMajor

Severity

Major

Summary

More than 50% of cinder-api endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"cinder-api"} == 0) >= count(http_response_status{name=~"cinder-api"}) * 0.5

Description

Raises when the check against a Cinder API endpoint does not pass on more than 50% of OpenStack controller nodes. For details on the affected nodes, see the host label in the CinderApiEndpointDown alerts. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the CinderApiEndpointDown alerts for the host names of the affected nodes.

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

CinderApiEndpointsOutage

Severity

Critical

Summary

All available cinder-api endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"cinder-api"} == 0) == count(http_response_status{name=~"cinder-api"})

Description

Raises when the check against a Cinder API endpoint does not pass on all OpenStack controller nodes. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the CinderApiEndpointDown alerts for the host names of the affected nodes.

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

CinderServiceDown

Severity

Minor

Summary

The {{ $labels.binary }} service on the {{ $labels.hostname }} node is down.

Raise condition

openstack_cinder_service_state == 0

Description

Raises when a Cinder service on the OpenStack controller or compute node is in the DOWN state. For the list of Cinder services, see Cinder Block Storage service overview. The binary and hostname labels contain the name of the service that is in the DOWN state and the node that hosts the service.

Troubleshooting

  • Verify the list of Cinder services and their states using openstack volume service list.

  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.

  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.

  • Verify the Telegraf monitoring_remote_agent service:

    • Verify the status of the monitoring_remote_agent service using docker service ls.

    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required

CinderServicesDownMinor

Severity

Minor

Summary

More than 30% of {{ $labels.binary }} services are down.

Raise condition

count by(binary) (openstack_cinder_service_state == 0) >= on(binary) count by(binary) (openstack_cinder_service_state) * 0.3

Description

Raises when a Cinder service is in the DOWN state on more than 30% of the ctl or cmp hosts. For the list of services, see Cinder Block Storage service overview. Inspect the hostname label in the CinderServiceDown alerts for details on the affected services and nodes.

Troubleshooting

  • Verify the list of Cinder services and their states using openstack volume service list.

  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.

  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.

  • Verify the Telegraf monitoring_remote_agent service:

    • Verify the status of the monitoring_remote_agent service using docker service ls.

    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required

CinderServicesDownMajor

Severity

Major

Summary

More than 60% of {{ $labels.binary }} services are down.

Raise condition

count  by(binary) (openstack_cinder_service_state == 0) >= on(binary) count by(binary) (openstack_cinder_service_state) * 0.6

Description

Raises when a Cinder service is in the DOWN state on more than 60% of the ctl or cmp hosts. For the list of services, see Cinder Block Storage service overview. Inspect the hostname label in the CinderServiceDown alerts for details on the affected services and nodes.

Troubleshooting

  • Verify the list of Cinder services and their states using openstack volume service list.

  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.

  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.

  • Verify the Telegraf monitoring_remote_agent service:

    • Verify the status of the monitoring_remote_agent service using docker service ls.

    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required

CinderServiceOutage

Severity

Critical

Summary

All {{ $labels.binary }} services are down.

Raise condition

count by(binary) (openstack_cinder_service_state == 0) == on(binary) count by(binary) (openstack_cinder_service_state)

Description

Raises when a Cinder service is in the DOWN state on all ctl or cmp hosts. For the list of services, see Cinder Block Storage service overview. Inspect the hostname label in the CinderServiceDown alerts for details on the affected services and nodes.

Troubleshooting

  • Verify the list of Cinder services and their states using openstack volume service list.

  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.

  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.

  • Verify the Telegraf monitoring_remote_agent service:

    • Verify the status of the monitoring_remote_agent service using docker service ls.

    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required

CinderVolumeProcessDown

Available starting from the 2019.2.8 maintenance update

Severity

Minor

Summary

A cinder-volume process is down.

Raise condition

procstat_running{process_name="cinder-volume"} == 0

Description

Raises when a cinder-volume process on a node is down. The host label in the raised alert contains the affected node.

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.

  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.

Tuning

Not required

CinderVolumeProcessesDownMinor

Available starting from the 2019.2.8 maintenance update

Severity

Minor

Summary

30% of cinder-volume processes are down.

Raise condition

count(procstat_running{process_name="cinder-volume"} == 0) >= count (procstat_running{process_name="cinder-volume"}) * {{ minor_threshold }}

Description

Raises when at least one cinder-volume process is down. Includes the number of cinder-volume processes in the DOWN state (>= {%- endraw %}{{minor_threshold*100}}%{%- raw %}).

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.

  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.

Tuning

Not required

CinderVolumeProcessesDownMajor

Available starting from the 2019.2.8 maintenance update

Severity

Major

Summary

60% of cinder-volume processes are down.

Raise condition

count(procstat_running{process_name="cinder-volume"} == 0) >= count (procstat_running{process_name="cinder-volume"}) * {{ major_threshold }}

Description

Raises when at least two cinder-volume processes are down. Includes the number of cinder-volume processes in the DOWN state (>= {%- endraw %}{{major_threshold*100}}%{%- raw %}).

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.

  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.

Tuning

Not required

CinderVolumeServiceOutage

Available starting from the 2019.2.8 maintenance update

Severity

Critical

Summary

The cinder-volume service is down.

Raise condition

count(procstat_running{process_name="cinder-volume"} == 0) == count (procstat_running{process_name="cinder-volume"})

Description

Raises when all cinder-volume processes are down.

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.

  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.

Tuning

Not required

CinderErrorLogsTooHigh

Severity

Warning

Summary

The average per-second rate of errors in Cinder logs on the {{ $labels.host }} node is larger than 0.2 messages.

Raise condition

sum without(level) (rate(log_messages{level=~"(?i:(error|emergency|fatal))", service="cinder"}[5m])) > 0.2

Description

Raises when the average per-second rate of error, fatal, or emergency messages in Cinder logs on the node is more than 0.2 per second. The host label in the raised alert contains the affected node. Fluentd forwards all logs from Cinder to Elasticsearch and counts the number of log messages per severity.

Troubleshooting

  • Inspect the log files in the /var/log/cinder/ directory on the corresponding node.

  • Inspect Cinder logs in the Kibana web UI.

Tuning description

Typically, you should not change the default value. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold.

To change the threshold to 0.4:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CinderErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service="cinder", \
                level=~"(?i:(error|emergency|fatal))"}[5m])) without (level) > 0.4
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.