Neutron

Neutron

This section describes the alerts for Neutron.


NeutronApiOutage

Removed since the 2019.2.11 maintenance update

Severity

Critical

Summary

Neutron API is not accessible for the Neutron endpoint in the OpenStack service catalog.

Raise condition

openstack_api_check_status{name="neutron"} == 0

Description

Raises when the checks against all available internal Neutron endpoints in the OpenStack service catalog do not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response code for Neutron is 200. For a list of all available endpoints, run openstack endpoint list.

Troubleshooting

Verify the availability of internal Neutron endpoints (URLs) from the output of openstack endpoint list.

Tuning

Not required

NeutronApiEndpointDown

Severity

Minor

Summary

The neutron-api endpoint on the {{ $labels.host }} node is not accessible for 2 minutes.

Raise condition

http_response_status{name="neutron-api"} == 0

Description

Raises when the check against a Neutron API endpoint does not pass, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

NeutronApiEndpointsDownMajor

Severity

Major

Summary

More than 50% of neutron-api endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name="neutron-api"} == 0) >= count (http_response_status{name="neutron-api"}) * 0.5

Description

Raises when the check against a Neutron API endpoint does not pass on more than 50% of OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. To identify the affected node, see the host label in the NeutronApiEndpointDown alert.

Troubleshooting

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

NeutronApiEndpointsOutage

Severity

Critical

Summary

All available neutron-api endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name="neutron-api"} == 0) == count (http_response_status{name="neutron-api"})

Description

Raises when the check against a Neutron API endpoint does not pass on all OpenStack controller nodes, typically indicating that the service endpoint is down or unreachable due to connectivity issues. Telegraf sends a request to the URL configured in /etc/telegraf/telegraf.d/input-http_response.conf on the corresponding node and compares the expected and actual HTTP response codes from the configuration file. To identify the affected node, see the host label in the NeutronApiEndpointDown alert.

Troubleshooting

  • Inspect the Telegraf logs using journalctl -u telegraf or in /var/log/telegraf.

  • Verify the configured URL availability using curl.

Tuning

Not required

NeutronAgentDown

Severity

Minor

Summary

The {{ $labels.binary }} agent on the {{ $labels.hostname }} node is down.

Raise condition

openstack_neutron_agent_state == 0

Description

Raises when a Neutron agent is in the DOWN state, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf monitoring_remote_agent service. The binary and hostname labels contain the name of the agent that is in the DOWN state and the node that hosts the agent.

Troubleshooting

  • Verify the statuses of Neutron agents using openstack network agent list.

  • Verify the status of the monitoring_remote_agent by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on a mon node.

Tuning

Not required

NeutronAgentsDownMinor

Severity

Minor

Summary

More than 30% of {{ $labels.binary }} agents are down.

Raise condition

count by(binary) (openstack_neutron_agent_state == 0) >= on(binary) count by(binary) (openstack_neutron_agent_state) * 0.3

Description

Raises when more than 30% of Neutron agents of the same type are in the DOWN state, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf monitoring_remote_agent service. The binary label contains the name of the agent that is in the DOWN state.

Troubleshooting

  • Verify the statuses of Neutron agents using openstack network agent list.

  • Inspect the NeutronAgentDown alert for the nodes and services that are in the DOWN state.

  • Verify the status of the monitoring_remote_agent by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required

NeutronAgentsDownMajor

Severity

Major

Summary

More than 60% of {{ $labels.binary }} agents are down.

Raise condition

count by(binary) (openstack_neutron_agent_state == 0) >= on(binary) count by(binary) (openstack_neutron_agent_state) * 0.6

Description

Raises when more than 60% of Neutron agents of the same type are in the DOWN state, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf monitoring_remote_agent service. The binary label contains the name of the agent that is in the DOWN state.

Troubleshooting

  • Verify the statuses of Neutron agents using openstack network agent list.

  • Inspect the NeutronAgentDown alert for the nodes and services that are in the DOWN state.

  • Verify the status of the monitoring_remote_agent by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required

NeutronAgentsOutage

Severity

Critical

Summary

All {{ $labels.binary }} agents are down.

Raise condition

count by(binary) (openstack_neutron_agent_state == 0) == on(binary) count by(binary) (openstack_neutron_agent_state)

Description

Raises when all Neutron agents of the same type are in the DOWN state and unavailable, according to the information from the Neutron API. For the list of Neutron services, see Networking service overview. This alert can also indicate issues with the Telegraf monitoring_remote_agent service. The binary label contains the name of the agent that is in the DOWN state.

Troubleshooting

  • Verify the statuses of Neutron agents using openstack network agent list.

  • Inspect the NeutronAgentDown alert for the nodes and services that are in the DOWN state.

  • Verify the status of the monitoring_remote_agent by running docker service ls on a mon node.

  • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required

NeutronErrorLogsTooHigh

Severity

Warning

Summary

The average per-second rate of errors in Neutron logs on the {{ $labels.host }} node is {{ $value }} (as measured over the last 5 minutes).

Raise condition

sum without(level) (rate(log_messages{level=~"(?i:(error|emergency| fatal))",service="neutron"}[5m])) > 0.2

Description

Raises when the average per-second rate of the error, fatal, or emergency messages in Neutron logs on the node is more than 0.2 per second. Fluentd forwards all logs from Neutron to Elasticsearch and counts the number of log messages per severity. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

Inspect the Neutron logs in the /var/log/neutron/ directory on the affected node.

Tuning

Typically, you should not change the default value. If the alert is constantly firing, inspect the Neutron error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold.

For example, to change the threshold to 0.4:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            NeutronErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service="neutron", level=~"(?i:\
                (error|emergency|fatal))"}[5m])) without (level) > 0.4
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.