OpenContrail

OpenContrail¶

This section describes the general alerts for OpenContrail, such as the API, processes, instance, and health check alerts.

ContrailApiDown
ContrailApiDownMinor
ContrailApiDownMajor
ContrailApiOutage
ContrailProcessDown
ContrailProcessDownMinor
ContrailProcessDownMajor
ContrailProcessOutage
ContrailHealthCheckDisabled
ContrailHealthCheckFailed
OpencontrailInstancePingCheckDown

ContrailApiDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` API endpoint on the `{{$labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"contrail.*"} == 0`
Description	Raises when the HTTP check for the OpenContrail API endpoint is failing. The `host` and `name` labels in the raised alert contain the host name of the affected node and the service name.
Troubleshooting	Before debugging OpenContrail, inspect the Neutron API, Keystone, and Neutron server alerts, if any. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailApiDownMinor¶

Severity	Minor
Summary	More than 30% of `{{ $labels.name }}` API endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"contrail."} == 0) by (name) >= count(http_response_status{name=~"contrail."}) by (name) * 0.3`
Description	Raises when 30% of the OpenContrail API HTTP checks fail. The `name` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the `ContrailApiDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailApiDownMajor¶

Severity	Major
Summary	More than 60% of `{{ $labels.name }}` API endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"contrail."} == 0) by (name) >= count(http_response_status{name=~"contrail."}) by (name) * 0.6`
Description	Raises when 60% of the OpenContrail API HTTP checks fail. The `name` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the `ContrailApiDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailApiOutage¶

Severity	Critical
Summary	The `{{ $labels.name }}` API is not accessible for all available endpoints for 2 minutes.
Raise condition	`count(http_response_status{name=~"contrail."} == 0) by (name) == count(http_response_status{name=~"contrail."}) by (name)`
Description	Raises when the HTTP checks fail for all OpenContrail API endpoints. The `name` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the `ContrailApiDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessDown¶

Severity	Minor
Summary	The `{{ $labels.process_name }}` process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name=~"contrail.*"} == 0`
Description	Raises when the OpenContrail process is down on one node. The `host` and `process_name` labels in the raised alert contain the host name of the affected node and the process name.
Troubleshooting	Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessDownMinor¶

Severity	Minor
Summary	More than 30% of `{{ $labels.process_name }}` processes are down.
Raise condition	`count(procstat_running{process_name=~"contrail."} == 0) by (process_name) >= 0.3 count (procstat_running{process_name=~"contrail.*"}) by (process_name)`
Description	Raises when 30% of the OpenContrail processes (by name) are in the `DOWN` state. The `process_name` in the raised alert contains the affected process name.
Troubleshooting	Inspect the `ContrailProcessDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessDownMajor¶

Severity	Major
Summary	More than 60% `{{ $labels.process_name }}` processes are down.
Raise condition	`count(procstat_running{process_name=~"contrail."} == 0) by (process_name) >= 0.6 count (procstat_running{process_name=~"contrail.*"}) by (process_name)`
Description	Raises when 60% of the OpenContrail processes (by name) are in the `DOWN` state. The `process_name` in the raised alert contains the affected process name.
Troubleshooting	Inspect the `ContrailProcessDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailProcessOutage¶

Severity	Critical
Summary	All `{{ $labels.process_name }}` processes are down.
Raise condition	`count(procstat_running{process_name=~"contrail."} == 0) by (process_name) == count(procstat_running{process_name=~"contrail."}) by (process_name)`
Description	Raises when an OpenContrail process is in the `DOWN` state on all nodes, indicating that the process is unavailable. The `process_name` in the raised alert contains the affected process name.
Troubleshooting	Inspect the `ContrailProcessDown` alerts for the host names of the affected nodes. Verify the service status using `systemctl status <service_name>` and the service logs in `/var/log/contrail/`. If the process is still running, obtain the details about the service state: Trace the system calls using `strace -p <pid> -e trace=network`. List the open files, including the network sockets and devices using `lsof -p <pid>`. Analyze the packets sent to the port used by the service using `tcpdump -nei any port <portnum> -A -s 1500`.
Tuning	Not required

ContrailMetadataCheck¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Critical
Summary	The OpenContrail metadata on the `{{ $labels.host }}` node is unavailable for 15 minutes.
Raise condition	`min(exec_contrail_instance_metadata_present) by (host) == 0`
Description	Raises when the `curl 127.0.0.1:8085/Snh_LinkLocalServiceInfo \| grep -c 169.254.169.254` HTTP query returns `0`, indicating that the OpenContrail metadata service is unavailable on the affected node.
Troubleshooting	Log in to the OpenContrail web UI using the credentials from `/etc/contrail/contrail-webui-userauth.js` on the network nodes. Navigate to Configure > Infrastructure > Link local services and verify that metadata is configured. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

ContrailHealthCheckDisabled¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Critical
Summary	The OpenContrail health check is disabled.
Raise condition	`absent(contrail_health_exit_code) == 1`
Description	Raises when the metric from the `contrail-status` check script is absent.
Troubleshooting	Inspect the Telegraf logs on the `ntw` nodes.
Tuning	Not required

ContrailHealthCheckFailed¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Critical
Summary	The OpenContrail health check failed for the `{{ $labels.contrail_service }}` service on the `{{ $labels.host }}` node.
Raise condition	`contrail_health_exit_code != 0`
Description	Raises when any `contrail` service from the output of `contrail-status` is inactive. The `contrail_service` label in the raised alert contains the affected service name.
Troubleshooting	Inspect the affected service.
Tuning	Not required

OpencontrailInstancePingCheckDown¶

^{Available starting from the 2019.2.4 maintenance update}

Severity	Major
Summary	The OpenContrail instance ping check on the `{{ $labels.host }}` node is down for 2 minutes.
Raise condition	`instance_ping_check_up == 0`
Description	Raises when the OpenContrail instance ping check on a node is down for 2 minutes. The `host` label in the raised alert contains the affected node name.
Tuning	Not required

updated: 2025-01-10 08:56

Kafka

View Previous Section

OpenContrail flows

OpenContrail

OpenContrail¶

ContrailApiDown¶

ContrailApiDownMinor¶

ContrailApiDownMajor¶

ContrailApiOutage¶

ContrailProcessDown¶

ContrailProcessDownMinor¶

ContrailProcessDownMajor¶

ContrailProcessOutage¶

ContrailMetadataCheck¶

ContrailHealthCheckDisabled¶

ContrailHealthCheckFailed¶

OpencontrailInstancePingCheckDown¶

View Previous Section

View Next Section