Nova service

Nova service¶

This section describes the Nova API and services alerts.

NovaApiOutage
NovaApiDown
NovaApiEndpointDown
NovaApiEndpointsDownMajor
NovaApiEndpointsOutage
NovaServiceDown
NovaServicesDownMinor

NovaComputeServicesDownMinor
NovaServicesDownMajor
NovaComputeServicesDownMajor
NovaServiceOutage
NovaErrorLogsTooHigh
NovaComputeSystemLoadTooHighWarning
NovaComputeSystemLoadTooHighCritical

NovaApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Nova API is not accessible for all available Nova endpoints in the OpenStack service catalog.
Raise condition	`max(openstack_api_check_status{name=~"nova.*\|placement"}) == 0`
Description	Raises when the checks against all available internal Nova or placement endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are `200` for `nova` and `nova20`, `200` and `401` for `placement`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the states of Nova endpoints from the output of `openstack endpoint list`. Inspect the `NovaApiDown` alert for the nodes and services that are in the `DOWN` state. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaApiDown¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Major
Summary	Nova API is not accessible for the `{{ $labels.name }}` endpoint.
Raise condition	`openstack_api_check_status{name=~"nova.*\|placement"} == 0`
Description	Raises when the checks against one of the available internal Nova or placement endpoint in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response codes are `200` for `nova` and `nova20`, `200` and `401` for `placement`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the states of Nova endpoints from the output of `openstack endpoint list`. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaApiEndpointDown¶

Severity	Minor
Summary	The `nova-api` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"nova-api"} == 0`
Description	Raises when the check against a Nova API endpoint does not pass on a node. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf/`. Verify the configured URL availability using `curl`.
Tuning	Not required

NovaApiEndpointsDownMajor¶

Severity	Major
Summary	`{{ $value }}` `nova-api` endpoints (>= 0.5 * 100) are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"nova-api"} == 0) >= count(http_response_status{name=~"nova-api"}) * 0.5`
Description	Raises when the check against a Nova API endpoint does not pass on more than 50% of OpenStack controller nodes. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected `200` response code and actual HTTP response codes. For details, see HTTP response input plugin.
Troubleshooting	Inspect the `NovaApiEndpointDown` alert for the nodes and services that are in the `DOWN` state. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf/`. Verify the configured URL availability using `curl`.
Tuning	Not required

NovaApiEndpointsOutage¶

Severity	Critical
Summary	All available `nova-api` endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"nova-api"} == 0) == count (http_response_status{name=~"nova-api"})`
Description	Raises when the check against a Nova API endpoint does not pass on all OpenStack controller nodes. Telegraf sends a request to the URL configured in `/etc/telegraf/telegraf.d/input-http_response.conf` on the corresponding node and compares the expected and actual HTTP response codes from the configuration file.
Troubleshooting	Inspect the `NovaApiEndpointDown` alert for the nodes and services that are in the `DOWN` state. Inspect the Telegraf logs using `journalctl -u telegraf` or in `/var/log/telegraf/`. Verify the configured URL availability using `curl`.
Tuning	Not required

NovaServiceDown¶

Severity	Minor
Summary	The `{{ $labels.binary }}` service on the `{{ $labels.hostname }}` node is down.
Raise condition	`openstack_nova_service_state == 0`
Description	Raises when the Nova controller or compute service (`os-service`) is in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview. The `binary` and `hostname` labels in the raised alert contain the service name that is in the `DOWN` state and the affected node name.
Troubleshooting	Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaServicesDownMinor¶

Severity	Minor
Summary	`{{ $value }}` `{{ $labels.binary }}` services (>=0.3 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary) count (openstack_nova_service_state{binary!~"nova-compute"}) by (binary) * 0.3`
Description	Raises when more than 30% of Nova controller services of the same type are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaComputeServicesDownMinor¶

Severity	Minor
Summary	`{{ $value }}` `nova-compute` services (>= 0.25 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * 0.25`
Description	Raises when more than 25% of Nova compute services are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaServicesDownMajor¶

Severity	Major
Summary	`{{ $value }}` `{{ $labels.binary }}` services (>= 0.25 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by (binary) >= on (binary)count(openstack_nova_service_state {binary!~"nova-compute"}) by (binary) * 0.6`
Description	Raises when more than 60% of Nova controller services of the same type are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaComputeServicesDownMajor¶

Severity	Major
Summary	`{{ $value }}` `nova-compute` services (>= 0.5 * 100%) are down.
Raise condition	`count(openstack_nova_service_state{binary="nova-compute"} == 0) >= count(openstack_nova_service_state{binary="nova-compute"}) * 0.5`
Description	Raises when more than 50% of Nova compute services are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview.
Troubleshooting	Inspect the `NovaServiceDown` alert for the nodes and services that are in the `DOWN` state. Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaServiceOutage¶

Severity	Critical
Summary	All `{{ $labels.binary }}` services are down.
Raise condition	`count(openstack_nova_service_state == 0) by (binary) == on (binary) count(openstack_nova_service_state) by (binary)`
Description	Raises when Nova controller or compute services of the same type are in the `DOWN` state, according to the data from Nova API. For details, see Compute services and Compute service overview. The `binary` and `hostname` labels in the raised alert contain the service name that is in the `DOWN` state and the affected node name.
Troubleshooting	Verify the states of Nova services from the output of the `openstack compute service list` command. Verify the status of the `monitoring_remote_agent` service by running `docker service ls` on a `mon` node. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on a `mon` node.
Tuning	Not required

NovaErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in the Nova logs on the `{{ $labels.host }}` node is more than 0.2 messages per second (as measured over the last 5 minutes).
Raise condition	`sum(rate(log_messages{service="nova",level=~"(?i:(error\|emergency\| fatal))"}[5m])) without (level) > 0.2`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in Nova logs on the node is more than 0.2 per second. Fluentd forwards all logs from Nova to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/nova/` directory of the affected node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Nova error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NovaErrorLogsTooHigh: if: >- sum(rate(log_messages{service="nova", level=~"(?i:\ (error\|emergency\|fatal))"}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

NovaComputeSystemLoadTooHighWarning¶

^{Available starting from the 2019.2.9 maintenance update}

Severity	Warning
Summary	The system load per CPU on the `{{ $labels.host }}` node is more than `1` for 5 minutes.
Raise condition	`system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.0`
Description	Raises when the average load on an OpenStack compute node is higher than `1` per CPU core over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `uptime` and `top` commands on the affected node.
Tuning	For example, to change the threshold to `1.5` per core: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NovaComputeSystemLoadTooHighWarning: if: >- system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

NovaComputeSystemLoadTooHighCritical¶