Nova service

Nova service

This section describes the Nova API and services alerts.


NovaApiOutage

Removed since the 2019.2.11 maintenance update

Severity

Critical

Summary

Nova API is not accessible for all available Nova endpoints in the OpenStack service catalog.

Raise condition

max(openstack_api_check_status{name=~"nova.*|placement"}) == 0

Description

Raises when the checks against all available internal Nova or placement endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are 200 for nova and nova20, 200 and 401 for placement. For a list of all available endpoints, run openstack endpoint list.

Troubleshooting

  • Verify the states of Nova endpoints from the output of openstack endpoint list.

  • Inspect the NovaApiDown alert for the nodes and services that are in the DOWN state.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaApiDown

Removed since the 2019.2.11 maintenance update

Severity

Major

Summary

Nova API is not accessible for the {{ $labels.name }} endpoint.

Raise condition

openstack_api_check_status{name=~"nova.*|placement"} == 0

Description

Raises when the checks against one of the available internal Nova or placement endpoint in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are 200 for nova and nova20, 200 and 401 for placement. For a list of all available endpoints, run openstack endpoint list.

Troubleshooting

  • Verify the states of Nova endpoints from the output of openstack endpoint list.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaApiEndpointDown

Severity

Minor

Summary

The nova-api endpoint on the {{ $labels.host }} node is not accessible for 2 minutes.

Raise condition

http_response_status{name=~"nova-api"} == 0

Description

Raises when the check against a Nova API endpoint does not pass on a node. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf/.

  • Verify the configured URL availability using curl.

Tuning

Not required

NovaApiEndpointsDownMajor

Severity

Major

Summary

{{ $value }} nova-api endpoints (>= 0.5 * 100) are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"nova-api"} == 0) >= count(http_response_status{name=~"nova-api"}) * 0.5

Description

Raises when the check against a Nova API endpoint does not pass on more than 50% of OpenStack controller nodes. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected 200 response code and actual HTTP response codes. For details, see HTTP response input plugin.

Troubleshooting

  • Inspect the NovaApiEndpointDown alert for the nodes and services that are in the DOWN state.

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf/.

  • Verify the configured URL availability using curl.

Tuning

Not required

NovaApiEndpointsOutage

Severity

Critical

Summary

All available nova-api endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"nova-api"} == 0) == count (http_response_status{name=~"nova-api"})

Description

Raises when the check against a Nova API endpoint does not pass on all OpenStack controller nodes. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.

Troubleshooting

  • Inspect the NovaApiEndpointDown alert for the nodes and services that are in the DOWN state.

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf/.

  • Verify the configured URL availability using curl.

Tuning

Not required

NovaServiceDown

Severity

Minor

Summary

The {{ $labels.binary }} service on the {{ $labels.hostname }} node is down.

Raise condition

openstack_nova_service_state == 0

Description

Raises when the Nova controller or compute service (os-service) is in the DOWN state, according to the data from Nova API. For details, see Compute services and Compute service overview. The binary and hostname labels in the raised alert contain the service name that is in the DOWN state and the affected node name.

Troubleshooting

  • Verify the states of Nova services from the output of the openstack compute service list command.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaServicesDownMinor

Severity

Minor

Summary

{{ $value }} {{ $labels.binary }} services (>=0.3 * 100%) are down.

Raise condition

count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary) count (openstack_nova_service_state{binary!~"nova-compute"}) by (binary) * 0.3

Description

Raises when more than 30% of Nova controller services of the same type are in the DOWN state, according to the data from Nova API. For details, see Compute services and Compute service overview.

Troubleshooting

  • Inspect the NovaServiceDown alert for the nodes and services that are in the DOWN state.

  • Verify the states of Nova services from the output of the openstack compute service list command.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaComputeServicesDownMinor

Severity

Minor

Summary

{{ $value }} nova-compute services (>= 0.25 * 100%) are down.

Raise condition

count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * 0.25

Description

Raises when more than 25% of Nova compute services are in the DOWN state, according to the data from Nova API. For details, see Compute services and Compute service overview.

Troubleshooting

  • Inspect the NovaServiceDown alert for the nodes and services that are in the DOWN state.

  • Verify the states of Nova services from the output of the openstack compute service list command.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaServicesDownMajor

Severity

Major

Summary

{{ $value }} {{ $labels.binary }} services (>= 0.25 * 100%) are down.

Raise condition

count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary)count(openstack_nova_service_state {binary!~"nova-compute"}) by (binary) * 0.6

Description

Raises when more than 60% of Nova controller services of the same type are in the DOWN state, according to the data from Nova API. For details, see Compute services and Compute service overview.

Troubleshooting

  • Inspect the NovaServiceDown alert for the nodes and services that are in the DOWN state.

  • Verify the states of Nova services from the output of the openstack compute service list command.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaComputeServicesDownMajor

Severity

Major

Summary

{{ $value }} nova-compute services (>= 0.5 * 100%) are down.

Raise condition

count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * 0.5

Description

Raises when more than 50% of Nova compute services are in the DOWN state, according to the data from Nova API. For details, see Compute services and Compute service overview.

Troubleshooting

  • Inspect the NovaServiceDown alert for the nodes and services that are in the DOWN state.

  • Verify the states of Nova services from the output of the openstack compute service list command.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaServiceOutage

Severity

Critical

Summary

All {{ $labels.binary }} services are down.

Raise condition

count(openstack_nova_service_state == 0) by (binary) == on (binary) count(openstack_nova_service_state) by (binary)

Description

Raises when Nova controller or compute services of the same type are in the DOWN state, according to the data from Nova API. For details, see Compute services and Compute service overview. The binary and hostname labels in the raised alert contain the service name that is in the DOWN state and the affected node name.

Troubleshooting

  • Verify the states of Nova services from the output of the openstack compute service list command.

  • Verify the status of the monitoring_remote_agent service by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NovaErrorLogsTooHigh

Severity

Warning

Summary

The average per-second rate of errors in the Nova logs on the {{ $labels.host }} node is more than 0.2 messages per second (as measured over the last 5 minutes).

Raise condition

sum(rate(log_messages{service="nova",level=~"(?i:(error|emergency| fatal))"}[5m])) without (level) > 0.2

Description

Raises when the average per-second rate of the error, fatal, or emergency messages in Nova logs on the node is more than 0.2 per second. Fluentd forwards all logs from Nova to Elasticsearch and counts the number of log messages per severity. The host label in the raised alert contains the name of the affected node.

Troubleshooting

Inspect the log files in the /var/log/nova/ directory of the affected node.

Tuning

Typically, you should not change the default value. If the alert is constantly firing, inspect the Nova error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold.

For example, to change the threshold to 0.4:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            NovaErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service="nova", level=~"(?i:\
                (error|emergency|fatal))"}[5m])) without (level) > 0.4
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

NovaComputeSystemLoadTooHighWarning

Available starting from the 2019.2.9 maintenance update

Severity

Warning

Summary

The system load per CPU on the {{ $labels.host }} node is more than 1 for 5 minutes.

Raise condition

system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.0

Description

Raises when the average load on an OpenStack compute node is higher than 1 per CPU core over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The host label in the raised alert contains the name of the affected node.

Troubleshooting

Inspect the output of the uptime and top commands on the affected node.

Tuning

For example, to change the threshold to 1.5 per core:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            NovaComputeSystemLoadTooHighWarning:
              if: >-
                system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

NovaComputeSystemLoadTooHighCritical

Available starting from the 2019.2.9 maintenance update

Severity

Critical

Summary

The system load per CPU on the {{ $labels.host }} node is more than 2 for 5 minutes.

Raise condition

system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 2.0

Description

Raises when the average load on an OpenStack compute node is higher than 2 per CPU over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The host label in the raised alert contains the name of the affected node.

Troubleshooting

Inspect the output of the uptime and top commands on the affected node.

Tuning

For example, to change the threshold to 3 per core:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            NovaComputeSystemLoadTooHighCritical:
              if: >-
                system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 3
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.