OpenContrail

OpenContrail

This section describes the general alerts for OpenContrail, such as the API, processes, instance, and health check alerts.


ContrailApiDown

Severity Minor
Summary The {{ $labels.name }} API endpoint on the {{$labels.host }} node is not accessible for 2 minutes.
Raise condition http_response_status{name=~"contrail.*"} == 0
Description Raises when the HTTP check for the OpenContrail API endpoint is failing. The host and name labels in the raised alert contain the host name of the affected node and the service name.
Troubleshooting
  • Before debugging OpenContrail, inspect the Neutron API, Keystone, and Neutron server alerts, if any.
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailApiDownMinor

Severity Minor
Summary More than 30% of {{ $labels.name }} API endpoints are not accessible for 2 minutes.
Raise condition count(http_response_status{name=~"contrail.*"} == 0) by (name) >= count(http_response_status{name=~"contrail.*"}) by (name) * 0.3
Description Raises when 30% of the OpenContrail API HTTP checks fail. The name label in the raised alert contains the affected service name.
Troubleshooting
  • Inspect the ContrailApiDown alerts for the host names of the affected nodes.
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailApiDownMajor

Severity Major
Summary More than 60% of {{ $labels.name }} API endpoints are not accessible for 2 minutes.
Raise condition count(http_response_status{name=~"contrail.*"} == 0) by (name) >= count(http_response_status{name=~"contrail.*"}) by (name) * 0.6
Description Raises when 60% of the OpenContrail API HTTP checks fail. The name label in the raised alert contains the affected service name.
Troubleshooting
  • Inspect the ContrailApiDown alerts for the host names of the affected nodes.
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailApiOutage

Severity Critical
Summary The {{ $labels.name }} API is not accessible for all available endpoints for 2 minutes.
Raise condition count(http_response_status{name=~"contrail.*"} == 0) by (name) == count(http_response_status{name=~"contrail.*"}) by (name)
Description Raises when the HTTP checks fail for all OpenContrail API endpoints. The name label in the raised alert contains the affected service name.
Troubleshooting
  • Inspect the ContrailApiDown alerts for the host names of the affected nodes.
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailProcessDown

Severity Minor
Summary The {{ $labels.process_name }} process on the {{ $labels.host }} node is down.
Raise condition procstat_running{process_name=~"contrail.*"} == 0
Description Raises when the OpenContrail process is down on one node. The host and process_name labels in the raised alert contain the host name of the affected node and the process name.
Troubleshooting
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailProcessDownMinor

Severity Minor
Summary More than 30% of {{ $labels.process_name }} processes are down.
Raise condition count(procstat_running{process_name=~"contrail.*"} == 0) by (process_name) >= 0.3 * count (procstat_running{process_name=~"contrail.*"}) by (process_name)
Description Raises when 30% of the OpenContrail processes (by name) are in the DOWN state. The process_name in the raised alert contains the affected process name.
Troubleshooting
  • Inspect the ContrailProcessDown alerts for the host names of the affected nodes.
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailProcessDownMajor

Severity Major
Summary More than 60% {{ $labels.process_name }} processes are down.
Raise condition count(procstat_running{process_name=~"contrail.*"} == 0) by (process_name) >= 0.6 * count (procstat_running{process_name=~"contrail.*"}) by (process_name)
Description Raises when 60% of the OpenContrail processes (by name) are in the DOWN state. The process_name in the raised alert contains the affected process name.
Troubleshooting
  • Inspect the ContrailProcessDown alerts for the host names of the affected nodes.
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailProcessOutage

Severity Critical
Summary All {{ $labels.process_name }} processes are down.
Raise condition count(procstat_running{process_name=~"contrail.*"} == 0) by (process_name) == count(procstat_running{process_name=~"contrail.*"}) by (process_name)
Description Raises when an OpenContrail process is in the DOWN state on all nodes, indicating that the process is unavailable. The process_name in the raised alert contains the affected process name.
Troubleshooting
  • Inspect the ContrailProcessDown alerts for the host names of the affected nodes.
  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.
  • If the process is still running, obtain the details about the service state:
    • Trace the system calls using strace -p <pid> -e trace=network.
    • List the open files, including the network sockets and devices using lsof -p <pid>.
    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.
Tuning Not required

ContrailMetadataCheck

Available starting from the 2019.2.3 maintenance update

Severity Critical
Summary The OpenContrail metadata on the {{ $labels.host }} node is unavailable for 15 minutes.
Raise condition min(exec_contrail_instance_metadata_present) by (host) == 0
Description Raises when the curl 127.0.0.1:8085/Snh_LinkLocalServiceInfo | grep -c 169.254.169.254 HTTP query returns 0, indicating that the OpenContrail metadata service is unavailable on the affected node.
Troubleshooting
  1. Log in to the OpenContrail web UI using the credentials from /etc/contrail/contrail-webui-userauth.js on the network nodes.
  2. Navigate to Configure > Infrastructure > Link local services and verify that metadata is configured.
  3. Inspect the Telegraf logs using journalctl -u telegraf.
Tuning Not required

ContrailHealthCheckDisabled

Available starting from the 2019.2.4 maintenance update

Severity Critical
Summary The OpenContrail health check is disabled.
Raise condition absent(contrail_health_exit_code) == 1
Description Raises when the metric from the contrail-status check script is absent.
Troubleshooting Inspect the Telegraf logs on the ntw nodes.
Tuning Not required

ContrailHealthCheckFailed

Available starting from the 2019.2.4 maintenance update

Severity Critical
Summary The OpenContrail health check failed for the {{ $labels.contrail_service }} service on the {{ $labels.host }} node.
Raise condition contrail_health_exit_code != 0
Description Raises when any contrail service from the output of contrail-status is inactive. The contrail_service label in the raised alert contains the affected service name.
Troubleshooting Inspect the affected service.
Tuning Not required

OpencontrailInstancePingCheckDown

Available starting from the 2019.2.4 maintenance update

Severity Major
Summary The OpenContrail instance ping check on the {{ $labels.host }} node is down for 2 minutes.
Raise condition instance_ping_check_up == 0
Description Raises when the OpenContrail instance ping check on a node is down for 2 minutes. The host label in the raised alert contains the affected node name.
Tuning Not required