Neutron
This section describes the alerts for Neutron.
NeutronApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Neutron API is not accessible for the Neutron endpoint in the OpenStack
service catalog. |
Raise condition |
openstack_api_check_status{name="neutron"} == 0 |
Description |
Raises when the checks against all available internal Neutron endpoints
in the OpenStack service catalog do not pass. Telegraf sends HTTP
requests to the URLs from the OpenStack service catalog and compares the
expected and actual HTTP response codes. The expected response code for
Neutron is 200 . For a list of all available endpoints, run
openstack endpoint list . |
Troubleshooting |
Verify the availability of internal Neutron endpoints (URLs) from the
output of openstack endpoint list . |
Tuning |
Not required |
NeutronApiEndpointDown
Severity |
Minor |
Summary |
The neutron-api endpoint on the {{ $labels.host }} node is not
accessible for 2 minutes. |
Raise condition |
http_response_status{name="neutron-api"} == 0 |
Description |
Raises when the check against a Neutron API endpoint does not pass,
typically indicating that the service endpoint is down or unreachable
due to connectivity issues. Telegraf sends a request to the URL
configured in /etc/telegraf/telegraf.d/input-http_response.conf on
the corresponding node and compares the expected and actual HTTP
response codes from the configuration file. The host label in the
raised alert contains the host name of the affected node. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
NeutronApiEndpointsDownMajor
Severity |
Major |
Summary |
More than 50% of neutron-api endpoints are not accessible for 2
minutes. |
Raise condition |
count(http_response_status{name="neutron-api"} == 0) >= count
(http_response_status{name="neutron-api"}) * 0.5 |
Description |
Raises when the check against a Neutron API endpoint does not pass on
more than 50% of OpenStack controller nodes, typically indicating that
the service endpoint is down or unreachable due to connectivity issues.
Telegraf sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. To identify the affected node,
see the host label in the NeutronApiEndpointDown alert. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
NeutronApiEndpointsOutage
Severity |
Critical |
Summary |
All available neutron-api endpoints are not accessible for 2
minutes. |
Raise condition |
count(http_response_status{name="neutron-api"} == 0) == count
(http_response_status{name="neutron-api"}) |
Description |
Raises when the check against a Neutron API endpoint does not pass on
all OpenStack controller nodes, typically indicating that the service
endpoint is down or unreachable due to connectivity issues. Telegraf
sends a request to the URL configured in
/etc/telegraf/telegraf.d/input-http_response.conf on the
corresponding node and compares the expected and actual HTTP response
codes from the configuration file. To identify the affected node,
see the host label in the NeutronApiEndpointDown alert. |
Troubleshooting |
- Inspect the Telegraf logs using
journalctl -u telegraf or in
/var/log/telegraf .
- Verify the configured URL availability using
curl .
|
Tuning |
Not required |
NeutronAgentDown
Severity |
Minor |
Summary |
The {{ $labels.binary }} agent on the {{ $labels.hostname }}
node is down. |
Raise condition |
openstack_neutron_agent_state == 0 |
Description |
Raises when a Neutron agent is in the DOWN state, according to the
information from the Neutron API. For the list of Neutron services, see
Networking service overview.
This alert can also indicate issues with the Telegraf
monitoring_remote_agent service. The binary and hostname
labels contain the name of the agent that is in the DOWN state and
the node that hosts the agent. |
Troubleshooting |
- Verify the statuses of Neutron agents using
openstack network agent list .
- Verify the status of the
monitoring_remote_agent by running
docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on a mon node.
|
Tuning |
Not required |
NeutronAgentsDownMinor
Severity |
Minor |
Summary |
More than 30% of {{ $labels.binary }} agents are down. |
Raise condition |
count by(binary) (openstack_neutron_agent_state == 0) >= on(binary)
count by(binary) (openstack_neutron_agent_state) * 0.3 |
Description |
Raises when more than 30% of Neutron agents of the same type are in the
DOWN state, according to the information from the Neutron API. For
the list of Neutron services, see Networking service overview.
This alert can also indicate issues with the Telegraf
monitoring_remote_agent service. The binary label contains the
name of the agent that is in the DOWN state. |
Troubleshooting |
- Verify the statuses of Neutron agents using
openstack network agent list .
- Inspect the
NeutronAgentDown alert for the nodes and services that
are in the DOWN state.
- Verify the status of the
monitoring_remote_agent by running
docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the mon
nodes.
|
Tuning |
Not required |
NeutronAgentsDownMajor
Severity |
Major |
Summary |
More than 60% of {{ $labels.binary }} agents are down. |
Raise condition |
count by(binary) (openstack_neutron_agent_state == 0) >= on(binary)
count by(binary) (openstack_neutron_agent_state) * 0.6 |
Description |
Raises when more than 60% of Neutron agents of the same type are in the
DOWN state, according to the information from the Neutron API. For
the list of Neutron services, see Networking service overview.
This alert can also indicate issues with the Telegraf
monitoring_remote_agent service. The binary label contains the
name of the agent that is in the DOWN state. |
Troubleshooting |
- Verify the statuses of Neutron agents using
openstack network agent list .
- Inspect the
NeutronAgentDown alert for the nodes and services that
are in the DOWN state.
- Verify the status of the
monitoring_remote_agent by running
docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the mon
nodes.
|
Tuning |
Not required |
NeutronAgentsOutage
Severity |
Critical |
Summary |
All {{ $labels.binary }} agents are down. |
Raise condition |
count by(binary) (openstack_neutron_agent_state == 0) == on(binary)
count by(binary) (openstack_neutron_agent_state) |
Description |
Raises when all Neutron agents of the same type are in the DOWN
state and unavailable, according to the information from the Neutron
API. For the list of Neutron services, see Networking service overview.
This alert can also indicate issues with the Telegraf
monitoring_remote_agent service. The binary label contains the
name of the agent that is in the DOWN state. |
Troubleshooting |
- Verify the statuses of Neutron agents using
openstack network agent list .
- Inspect the
NeutronAgentDown alert for the nodes and services that
are in the DOWN state.
- Verify the status of the
monitoring_remote_agent by running
docker service ls on a mon node.
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the mon
nodes.
|
Tuning |
Not required |
NeutronErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in Neutron logs on the
{{ $labels.host }} node is {{ $value }} (as measured over the
last 5 minutes). |
Raise condition |
sum without(level) (rate(log_messages{level=~"(?i:(error|emergency|
fatal))",service="neutron"}[5m])) > 0.2 |
Description |
Raises when the average per-second rate of the error , fatal , or
emergency messages in Neutron logs on the node is more than 0.2 per
second. Fluentd forwards all logs from Neutron to Elasticsearch and
counts the number of log messages per severity. The host label in
the raised alert contains the host name of the affected node. |
Troubleshooting |
Inspect the Neutron logs in the /var/log/neutron/ directory on the
affected node. |
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, inspect the Neutron error logs in the Kibana web UI.
However, you can adjust the threshold to an acceptable error rate for a
particular environment. In the Prometheus web UI, use the raise
condition query to view the appearance rate of a particular message type
in logs for a longer period of time and define the best threshold.
For example, to change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert by
overriding the if parameter:
parameters:
prometheus:
server:
alert:
NeutronErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service="neutron", level=~"(?i:\
(error|emergency|fatal))"}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|