Nova service
This section describes the Nova API and services alerts.
NovaApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Nova API is not accessible for all available Nova endpoints in the
OpenStack service catalog. |
Raise condition |
max(openstack_api_check_status{name=~"nova.*|placement"}) == 0 |
Description |
Raises when the checks against all available internal Nova or placement
endpoints in the OpenStack service catalog do not pass. Telegraf sends
HTTP requests to the URLs from the OpenStack service catalog and
compares the expected and actual HTTP response codes. The expected
response codes are 200 for nova and nova20 , 200 and
401 for placement . For a list of all available endpoints, run
openstack endpoint list . |
Troubleshooting |
- Verify the states of Nova endpoints from the output of
openstack endpoint list .
- Inspect the
NovaApiDown alert for the nodes and services that are
in the DOWN state.
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaApiDown
Removed since the 2019.2.11 maintenance update
Severity |
Major |
Summary |
Nova API is not accessible for the {{ $labels.name }} endpoint. |
Raise condition |
openstack_api_check_status{name=~"nova.*|placement"} == 0 |
Description |
Raises when the checks against one of the available internal Nova or
placement endpoint in the OpenStack service catalog do not pass.
Telegraf sends HTTP requests to the URLs from the OpenStack service
catalog and compares the expected and actual HTTP response codes. The
expected response codes are 200 for nova and nova20 , 200
and 401 for placement . For a list of all available endpoints,
run openstack endpoint list . |
Troubleshooting |
- Verify the states of Nova endpoints from the output of
openstack endpoint list .
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaApiEndpointDown
Severity |
Minor |
Summary |
The nova-api endpoint on the {{ $labels.host }} node is not
accessible for 2 minutes. |
Raise condition |
http_response_status{name=~"nova-api"} == 0 |
Description |
Raises when the check against a Nova API endpoint does not pass on a
node. Telegraf sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf/ .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
NovaApiEndpointsDownMajor
Severity |
Major |
Summary |
{{ $value }} nova-api endpoints (>= 0.5 * 100) are not
accessible for 2 minutes. |
Raise condition |
count(http_response_status{name=~"nova-api"} == 0) >=
count(http_response_status{name=~"nova-api"}) * 0.5 |
Description |
Raises when the check against a Nova API endpoint does not pass on more
than 50% of OpenStack controller nodes. Telegraf sends a request to the
URL configured in /etc/telegraf/telegraf.d/input-http_response.conf
on the corresponding node and compares the expected 200 response
code and actual HTTP response codes. For details, see
HTTP response input plugin. |
Troubleshooting |
- Inspect the
NovaApiEndpointDown alert for the nodes and services
that are in the DOWN state.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf/ .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
NovaApiEndpointsOutage
Severity |
Critical |
Summary |
All available nova-api endpoints are not accessible for 2 minutes. |
Raise condition |
count(http_response_status{name=~"nova-api"} == 0) == count
(http_response_status{name=~"nova-api"}) |
Description |
Raises when the check against a Nova API endpoint does not pass on all
OpenStack controller nodes. Telegraf sends a request to the URL
configured in /etc/telegraf/telegraf.d/input-http_response.conf on
the corresponding node and compares the expected and actual HTTP
response codes from the configuration file. |
Troubleshooting |
- Inspect the
NovaApiEndpointDown alert for the nodes and services
that are in the DOWN state.
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf/ .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
NovaServiceDown
Severity |
Minor |
Summary |
The {{ $labels.binary }} service on the {{ $labels.hostname }}
node is down. |
Raise condition |
openstack_nova_service_state == 0 |
Description |
Raises when the Nova controller or compute service (os-service ) is
in the DOWN state, according to the data from Nova API. For details,
see Compute services
and Compute service overview.
The binary and hostname labels in the raised alert contain the
service name that is in the DOWN state and the affected node name. |
Troubleshooting |
- Verify the states of Nova services from the output of the
openstack compute service list command.
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaServicesDownMinor
Severity |
Minor |
Summary |
{{ $value }} {{ $labels.binary }} services (>=0.3 * 100%) are
down. |
Raise condition |
count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by
(binary) >= on (binary) count
(openstack_nova_service_state{binary!~"nova-compute"}) by (binary) *
0.3 |
Description |
Raises when more than 30% of Nova controller services of the same type
are in the DOWN state, according to the data from Nova API. For
details, see
Compute services
and Compute service overview. |
Troubleshooting |
- Inspect the
NovaServiceDown alert for the nodes and services
that are in the DOWN state.
- Verify the states of Nova services from the output of the
openstack compute service list command.
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaComputeServicesDownMinor
Severity |
Minor |
Summary |
{{ $value }} nova-compute services (>= 0.25 * 100%) are down. |
Raise condition |
count(openstack_nova_service_state{binary="nova-compute"} == 0) >=
count(openstack_nova_service_state{binary="nova-compute"}) * 0.25 |
Description |
Raises when more than 25% of Nova compute services are in the DOWN
state, according to the data from Nova API. For details,
see Compute services
and Compute service overview. |
Troubleshooting |
- Inspect the
NovaServiceDown alert for the nodes and services
that are in the DOWN state.
- Verify the states of Nova services from the output of the
openstack compute service list command.
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaServicesDownMajor
Severity |
Major |
Summary |
{{ $value }} {{ $labels.binary }} services (>= 0.25 * 100%) are
down. |
Raise condition |
count(openstack_nova_service_state{binary!~"nova-compute"} == 0) by
(binary) >= on (binary)count(openstack_nova_service_state
{binary!~"nova-compute"}) by (binary) * 0.6 |
Description |
Raises when more than 60% of Nova controller services of the same type
are in the DOWN state, according to the data from Nova API. For
details, see
Compute services
and Compute service overview. |
Troubleshooting |
- Inspect the
NovaServiceDown alert for the nodes and services
that are in the DOWN state.
- Verify the states of Nova services from the output of the
openstack compute service list command.
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaComputeServicesDownMajor
Severity |
Major |
Summary |
{{ $value }} nova-compute services (>= 0.5 * 100%) are down. |
Raise condition |
count(openstack_nova_service_state{binary="nova-compute"} == 0) >=
count(openstack_nova_service_state{binary="nova-compute"}) * 0.5 |
Description |
Raises when more than 50% of Nova compute services are in the DOWN
state, according to the data from Nova API. For details,
see Compute services
and Compute service overview. |
Troubleshooting |
- Inspect the
NovaServiceDown alert for the nodes and services
that are in the DOWN state.
- Verify the states of Nova services from the output of the
openstack compute service list command.
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaServiceOutage
Severity |
Critical |
Summary |
All {{ $labels.binary }} services are down. |
Raise condition |
count(openstack_nova_service_state == 0) by (binary) == on (binary)
count(openstack_nova_service_state) by (binary) |
Description |
Raises when Nova controller or compute services of the same type are in
the DOWN state, according to the data from Nova API. For details,
see Compute services
and Compute service overview.
The binary and hostname labels in the raised alert contain the
service name that is in the DOWN state and the affected node name. |
Troubleshooting |
- Verify the states of Nova services from the output of the
openstack compute service list command.
- Verify the status of the
monitoring_remote_agent service by
running docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NovaErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in the Nova logs on the
{{ $labels.host }} node is more than 0.2 messages per second (as
measured over the last 5 minutes). |
Raise condition |
sum(rate(log_messages{service="nova",level=~"(?i:(error|emergency|
fatal))"}[5m])) without (level) > 0.2 |
Description |
Raises when the average per-second rate of the error , fatal , or
emergency messages in Nova logs on the node is more than 0.2 per
second. Fluentd forwards all logs from Nova to Elasticsearch and counts
the number of log messages per severity. The host label in the
raised alert contains the name of the affected node. |
Troubleshooting |
Inspect the log files in the /var/log/nova/ directory of the
affected node. |
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, inspect the Nova error logs in the Kibana web UI.
However, you can adjust the threshold to an acceptable error rate for a
particular environment. In the Prometheus Web UI, use the raise
condition query to view the appearance rate of a particular message type
in logs for a longer period of time and define the best threshold.
For example, to change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
NovaErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service="nova", level=~"(?i:\
(error|emergency|fatal))"}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
NovaComputeSystemLoadTooHighWarning
Available starting from the 2019.2.9 maintenance update
Severity |
Warning |
Summary |
The system load per CPU on the {{ $labels.host }} node is more than
1 for 5 minutes. |
Raise condition |
system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.0 |
Description |
Raises when the average load on an OpenStack compute node is higher than
1 per CPU core over the last 5 minutes, indicating that the system
is overloaded, many processes are waiting for CPU time. The host
label in the raised alert contains the name of the affected node. |
Troubleshooting |
Inspect the output of the uptime and top commands on the
affected node. |
Tuning |
For example, to change the threshold to 1.5 per core:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
NovaComputeSystemLoadTooHighWarning:
if: >-
system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 1.5
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
NovaComputeSystemLoadTooHighCritical
Available starting from the 2019.2.9 maintenance update
Severity |
Critical |
Summary |
The system load per CPU on the {{ $labels.host }} node is more than
2 for 5 minutes. |
Raise condition |
system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 2.0 |
Description |
Raises when the average load on an OpenStack compute node is higher than
2 per CPU over the last 5 minutes, indicating that the system is
overloaded, many processes are waiting for CPU time. The host label
in the raised alert contains the name of the affected node. |
Troubleshooting |
Inspect the output of the uptime and top commands on the
affected node. |
Tuning |
For example, to change the threshold to 3 per core:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
NovaComputeSystemLoadTooHighCritical:
if: >-
system_load15{host=~".*cmp[0-9]+"} / system_n_cpus > 3
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|