Cinder

Cinder

This section describes the alerts for Cinder.


CinderApiOutage

Removed since the 2019.2.11 maintenance update

Severity Critical
Summary Cinder API is not accessible for all available Cinder endpoints in the OpenStack service catalog.
Raise condition max(openstack_api_check_status{name=~"cinder.*"}) == 0
Description Raises when the checks against all available internal Cinder endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Cinder, Cinderv2, and Cinderv3 are 200 and 300. For a list of all available endpoints, run openstack endpoint list.
Troubleshooting Verify the availability of internal Cinder endpoints (URLs) from the output of openstack endpoint list.
Tuning Not required

CinderApiDown

Removed since the 2019.2.11 maintenance update

Severity Major
Summary Cinder API is not accessible for the {{ $labels.name }} endpoint.
Raise condition openstack_api_check_status{name=~"cinder.*"} == 0
Description Raises when the check against one available internal Cinder endpoints in the OpenStack service catalog does not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes for Cinder, Cinderv2, and Cinderv3 are 200 and 300. For a list of all available endpoints, run openstack endpoint list.
Troubleshooting Verify the availability of internal Cinder endpoints (URLs) from the output of the openstack endpoint list command.
Tuning Not required

CinderApiEndpointDown

Severity Minor
Summary The cinder-api endpoint on the {{ $labels.host }} node is not accessible for 2 minutes.
Raise condition http_response_status{name=~"cinder-api"} == 0
Description Raises when the check against a Cinder API endpoint does not pass, typically meaning that the service endpoint is down or unreachable due to connectivity issues. The host label in the raised alert contains the host name of the affected node. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting
  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.
  • Verify the configured URL availability using curl.
Tuning Not required

CinderApiEndpointDownMajor

Severity Major
Summary More than 50% of cinder-api endpoints are not accessible for 2 minutes.
Raise condition count(http_response_status{name=~"cinder-api"} == 0) >= count(http_response_status{name=~"cinder-api"}) * 0.5
Description Raises when the check against a Cinder API endpoint does not pass on more than 50% of OpenStack controller nodes. For details on the affected nodes, see the host label in the CinderApiEndpointDown alerts. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting
  • Inspect the CinderApiEndpointDown alerts for the host names of the affected nodes.
  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.
  • Verify the configured URL availability using curl.
Tuning Not required

CinderApiEndpointsOutage

Severity Critical
Summary All available cinder-api endpoints are not accessible for 2 minutes.
Raise condition count(http_response_status{name=~"cinder-api"} == 0) == count(http_response_status{name=~"cinder-api"})
Description Raises when the check against a Cinder API endpoint does not pass on all OpenStack controller nodes. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting
  • Inspect the CinderApiEndpointDown alerts for the host names of the affected nodes.
  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.
  • Verify the configured URL availability using curl.
Tuning Not required

CinderServiceDown

Severity Minor
Summary The {{ $labels.binary }} service on the {{ $labels.hostname }} node is down.
Raise condition openstack_cinder_service_state == 0
Description Raises when a Cinder service on the OpenStack controller or compute node is in the DOWN state. For the list of Cinder services, see Cinder Block Storage service overview. The binary and hostname labels contain the name of the service that is in the DOWN state and the node that hosts the service.
Troubleshooting
  • Verify the list of Cinder services and their states using openstack volume service list.
  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.
  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.
  • Verify the Telegraf monitoring_remote_agent service:
    • Verify the status of the monitoring_remote_agent service using docker service ls.
    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.
Tuning Not required

CinderServicesDownMinor

Severity Minor
Summary More than 30% of {{ $labels.binary }} services are down.
Raise condition count by(binary) (openstack_cinder_service_state == 0) >= on(binary) count by(binary) (openstack_cinder_service_state) * 0.3
Description Raises when a Cinder service is in the DOWN state on more than 30% of the ctl or cmp hosts. For the list of services, see Cinder Block Storage service overview. Inspect the hostname label in the CinderServiceDown alerts for details on the affected services and nodes.
Troubleshooting
  • Verify the list of Cinder services and their states using openstack volume service list.
  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.
  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.
  • Verify the Telegraf monitoring_remote_agent service:
    • Verify the status of the monitoring_remote_agent service using docker service ls.
    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.
Tuning Not required

CinderServicesDownMajor

Severity Major
Summary More than 60% of {{ $labels.binary }} services are down.
Raise condition count  by(binary) (openstack_cinder_service_state == 0) >= on(binary) count by(binary) (openstack_cinder_service_state) * 0.6
Description Raises when a Cinder service is in the DOWN state on more than 60% of the ctl or cmp hosts. For the list of services, see Cinder Block Storage service overview. Inspect the hostname label in the CinderServiceDown alerts for details on the affected services and nodes.
Troubleshooting
  • Verify the list of Cinder services and their states using openstack volume service list.
  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.
  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.
  • Verify the Telegraf monitoring_remote_agent service:
    • Verify the status of the monitoring_remote_agent service using docker service ls.
    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.
Tuning Not required

CinderServiceOutage

Severity Critical
Summary All {{ $labels.binary }} services are down.
Raise condition count by(binary) (openstack_cinder_service_state == 0) == on(binary) count by(binary) (openstack_cinder_service_state)
Description Raises when a Cinder service is in the DOWN state on all ctl or cmp hosts. For the list of services, see Cinder Block Storage service overview. Inspect the hostname label in the CinderServiceDown alerts for details on the affected services and nodes.
Troubleshooting
  • Verify the list of Cinder services and their states using openstack volume service list.
  • Verify the status of the corresponding Cinder service on the affected node using systemctl service <binary>.
  • Inspect the logs of the corresponding Cinder service on the affected node in the /var/log/cinder/ directory.
  • Verify the Telegraf monitoring_remote_agent service:
    • Verify the status of the monitoring_remote_agent service using docker service ls.
    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.
Tuning Not required

CinderVolumeProcessDown

Available starting from the 2019.2.8 maintenance update

Severity Minor
Summary A cinder-volume process is down.
Raise condition procstat_running{process_name="cinder-volume"} == 0
Description Raises when a cinder-volume process on a node is down. The host label in the raised alert contains the affected node.
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.
  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.
Tuning Not required

CinderVolumeProcessesDownMinor

Available starting from the 2019.2.8 maintenance update

Severity Minor
Summary 30% of cinder-volume processes are down.
Raise condition count(procstat_running{process_name="cinder-volume"} == 0) >= count (procstat_running{process_name="cinder-volume"}) * {{ minor_threshold }}
Description Raises when at least one cinder-volume process is down. Includes the number of cinder-volume processes in the DOWN state (>= {%- endraw %}{{minor_threshold*100}}%{%- raw %}).
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.
  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.
Tuning Not required

CinderVolumeProcessesDownMajor

Available starting from the 2019.2.8 maintenance update

Severity Major
Summary 60% of cinder-volume processes are down.
Raise condition count(procstat_running{process_name="cinder-volume"} == 0) >= count (procstat_running{process_name="cinder-volume"}) * {{ major_threshold }}
Description Raises when at least two cinder-volume processes are down. Includes the number of cinder-volume processes in the DOWN state (>= {%- endraw %}{{major_threshold*100}}%{%- raw %}).
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.
  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.
Tuning Not required

CinderVolumeServiceOutage

Available starting from the 2019.2.8 maintenance update

Severity Critical
Summary The cinder-volume service is down.
Raise condition count(procstat_running{process_name="cinder-volume"} == 0) == count (procstat_running{process_name="cinder-volume"})
Description Raises when all cinder-volume processes are down.
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status cinder-volume.
  • Inspect the cinder-volume log files in the /var/log/cinder/ directory.
Tuning Not required

CinderErrorLogsTooHigh

Severity Warning
Summary The average per-second rate of errors in Cinder logs on the {{ $labels.host }} node is larger than 0.2 messages.
Raise condition sum without(level) (rate(log_messages{level=~"(?i:(error|emergency|fatal))", service="cinder"}[5m])) > 0.2
Description Raises when the average per-second rate of error, fatal, or emergency messages in Cinder logs on the node is more than 0.2 per second. The host label in the raised alert contains the affected node. Fluentd forwards all logs from Cinder to Elasticsearch and counts the number of log messages per severity.
Troubleshooting
  • Inspect the log files in the /var/log/cinder/ directory on the corresponding node.
  • Inspect Cinder logs in the Kibana web UI.
Tuning description

Typically, you should not change the default value. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold.

To change the threshold to 0.4:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CinderErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service="cinder", \
                level=~"(?i:(error|emergency|fatal))"}[5m])) without (level) > 0.4
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.