OpenContrail

OpenContrail

This section describes the general alerts for OpenContrail, such as the API, processes, instance, and health check alerts.


ContrailApiDown

Severity

Minor

Summary

The {{ $labels.name }} API endpoint on the {{$labels.host }} node is not accessible for 2 minutes.

Raise condition

http_response_status{name=~"contrail.*"} == 0

Description

Raises when the HTTP check for the OpenContrail API endpoint is failing. The host and name labels in the raised alert contain the host name of the affected node and the service name.

Troubleshooting

  • Before debugging OpenContrail, inspect the Neutron API, Keystone, and Neutron server alerts, if any.

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailApiDownMinor

Severity

Minor

Summary

More than 30% of {{ $labels.name }} API endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"contrail.*"} == 0) by (name) >= count(http_response_status{name=~"contrail.*"}) by (name) * 0.3

Description

Raises when 30% of the OpenContrail API HTTP checks fail. The name label in the raised alert contains the affected service name.

Troubleshooting

  • Inspect the ContrailApiDown alerts for the host names of the affected nodes.

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailApiDownMajor

Severity

Major

Summary

More than 60% of {{ $labels.name }} API endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"contrail.*"} == 0) by (name) >= count(http_response_status{name=~"contrail.*"}) by (name) * 0.6

Description

Raises when 60% of the OpenContrail API HTTP checks fail. The name label in the raised alert contains the affected service name.

Troubleshooting

  • Inspect the ContrailApiDown alerts for the host names of the affected nodes.

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailApiOutage

Severity

Critical

Summary

The {{ $labels.name }} API is not accessible for all available endpoints for 2 minutes.

Raise condition

count(http_response_status{name=~"contrail.*"} == 0) by (name) == count(http_response_status{name=~"contrail.*"}) by (name)

Description

Raises when the HTTP checks fail for all OpenContrail API endpoints. The name label in the raised alert contains the affected service name.

Troubleshooting

  • Inspect the ContrailApiDown alerts for the host names of the affected nodes.

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailProcessDown

Severity

Minor

Summary

The {{ $labels.process_name }} process on the {{ $labels.host }} node is down.

Raise condition

procstat_running{process_name=~"contrail.*"} == 0

Description

Raises when the OpenContrail process is down on one node. The host and process_name labels in the raised alert contain the host name of the affected node and the process name.

Troubleshooting

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailProcessDownMinor

Severity

Minor

Summary

More than 30% of {{ $labels.process_name }} processes are down.

Raise condition

count(procstat_running{process_name=~"contrail.*"} == 0) by (process_name) >= 0.3 * count (procstat_running{process_name=~"contrail.*"}) by (process_name)

Description

Raises when 30% of the OpenContrail processes (by name) are in the DOWN state. The process_name in the raised alert contains the affected process name.

Troubleshooting

  • Inspect the ContrailProcessDown alerts for the host names of the affected nodes.

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailProcessDownMajor

Severity

Major

Summary

More than 60% {{ $labels.process_name }} processes are down.

Raise condition

count(procstat_running{process_name=~"contrail.*"} == 0) by (process_name) >= 0.6 * count (procstat_running{process_name=~"contrail.*"}) by (process_name)

Description

Raises when 60% of the OpenContrail processes (by name) are in the DOWN state. The process_name in the raised alert contains the affected process name.

Troubleshooting

  • Inspect the ContrailProcessDown alerts for the host names of the affected nodes.

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailProcessOutage

Severity

Critical

Summary

All {{ $labels.process_name }} processes are down.

Raise condition

count(procstat_running{process_name=~"contrail.*"} == 0) by (process_name) == count(procstat_running{process_name=~"contrail.*"}) by (process_name)

Description

Raises when an OpenContrail process is in the DOWN state on all nodes, indicating that the process is unavailable. The process_name in the raised alert contains the affected process name.

Troubleshooting

  • Inspect the ContrailProcessDown alerts for the host names of the affected nodes.

  • Verify the service status using systemctl status <service_name> and the service logs in /var/log/contrail/.

  • If the process is still running, obtain the details about the service state:

    • Trace the system calls using strace -p <pid> -e trace=network.

    • List the open files, including the network sockets and devices using lsof -p <pid>.

    • Analyze the packets sent to the port used by the service using tcpdump -nei any port <portnum> -A -s 1500.

Tuning

Not required

ContrailMetadataCheck

Available starting from the 2019.2.3 maintenance update

Severity

Critical

Summary

The OpenContrail metadata on the {{ $labels.host }} node is unavailable for 15 minutes.

Raise condition

min(exec_contrail_instance_metadata_present) by (host) == 0

Description

Raises when the curl 127.0.0.1:8085/Snh_LinkLocalServiceInfo | grep -c 169.254.169.254 HTTP query returns 0, indicating that the OpenContrail metadata service is unavailable on the affected node.

Troubleshooting

  1. Log in to the OpenContrail web UI using the credentials from /etc/contrail/contrail-webui-userauth.js on the network nodes.

  2. Navigate to Configure > Infrastructure > Link local services and verify that metadata is configured.

  3. Inspect the Telegraf logs using journalctl -u telegraf.

Tuning

Not required

ContrailHealthCheckDisabled

Available starting from the 2019.2.4 maintenance update

Severity

Critical

Summary

The OpenContrail health check is disabled.

Raise condition

absent(contrail_health_exit_code) == 1

Description

Raises when the metric from the contrail-status check script is absent.

Troubleshooting

Inspect the Telegraf logs on the ntw nodes.

Tuning

Not required

ContrailHealthCheckFailed

Available starting from the 2019.2.4 maintenance update

Severity

Critical

Summary

The OpenContrail health check failed for the {{ $labels.contrail_service }} service on the {{ $labels.host }} node.

Raise condition

contrail_health_exit_code != 0

Description

Raises when any contrail service from the output of contrail-status is inactive. The contrail_service label in the raised alert contains the affected service name.

Troubleshooting

Inspect the affected service.

Tuning

Not required

OpencontrailInstancePingCheckDown

Available starting from the 2019.2.4 maintenance update

Severity

Major

Summary

The OpenContrail instance ping check on the {{ $labels.host }} node is down for 2 minutes.

Raise condition

instance_ping_check_up == 0

Description

Raises when the OpenContrail instance ping check on a node is down for 2 minutes. The host label in the raised alert contains the affected node name.

Tuning

Not required