OpenContrail
This section describes the general alerts for OpenContrail, such as the API,
processes, instance, and health check alerts.
ContrailApiDown
Severity |
Minor |
Summary |
The {{ $labels.name }} API endpoint on the {{$labels.host }}
node is not accessible for 2 minutes. |
Raise condition |
http_response_status{name=~"contrail.*"} == 0 |
Description |
Raises when the HTTP check for the OpenContrail API endpoint is failing.
The host and name labels in the raised alert contain the host
name of the affected node and the service name. |
Troubleshooting |
- Before debugging OpenContrail, inspect the Neutron API, Keystone, and
Neutron server alerts, if any.
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailApiDownMinor
Severity |
Minor |
Summary |
More than 30% of {{ $labels.name }} API endpoints are not accessible
for 2 minutes. |
Raise condition |
count(http_response_status{name=~"contrail.*"} == 0) by (name) >=
count(http_response_status{name=~"contrail.*"}) by (name) * 0.3 |
Description |
Raises when 30% of the OpenContrail API HTTP checks fail. The name
label in the raised alert contains the affected service name. |
Troubleshooting |
- Inspect the
ContrailApiDown alerts for the host names of the
affected nodes.
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailApiDownMajor
Severity |
Major |
Summary |
More than 60% of {{ $labels.name }} API endpoints are not accessible
for 2 minutes. |
Raise condition |
count(http_response_status{name=~"contrail.*"} == 0) by (name) >=
count(http_response_status{name=~"contrail.*"}) by (name) * 0.6 |
Description |
Raises when 60% of the OpenContrail API HTTP checks fail. The name
label in the raised alert contains the affected service name. |
Troubleshooting |
- Inspect the
ContrailApiDown alerts for the host names of the
affected nodes.
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailApiOutage
Severity |
Critical |
Summary |
The {{ $labels.name }} API is not accessible for all available
endpoints for 2 minutes. |
Raise condition |
count(http_response_status{name=~"contrail.*"} == 0) by (name) ==
count(http_response_status{name=~"contrail.*"}) by (name) |
Description |
Raises when the HTTP checks fail for all OpenContrail API endpoints. The
name label in the raised alert contains the affected service name. |
Troubleshooting |
- Inspect the
ContrailApiDown alerts for the host names of the
affected nodes.
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailProcessDown
Severity |
Minor |
Summary |
The {{ $labels.process_name }} process on the {{ $labels.host }}
node is down. |
Raise condition |
procstat_running{process_name=~"contrail.*"} == 0 |
Description |
Raises when the OpenContrail process is down on one node. The host
and process_name labels in the raised alert contain the host
name of the affected node and the process name. |
Troubleshooting |
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailProcessDownMinor
Severity |
Minor |
Summary |
More than 30% of {{ $labels.process_name }} processes are down. |
Raise condition |
count(procstat_running{process_name=~"contrail.*"} == 0) by
(process_name) >= 0.3 * count
(procstat_running{process_name=~"contrail.*"}) by (process_name) |
Description |
Raises when 30% of the OpenContrail processes (by name) are in the
DOWN state. The process_name in the raised alert contains the
affected process name. |
Troubleshooting |
- Inspect the
ContrailProcessDown alerts for the host names of the
affected nodes.
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailProcessDownMajor
Severity |
Major |
Summary |
More than 60% {{ $labels.process_name }} processes are down. |
Raise condition |
count(procstat_running{process_name=~"contrail.*"} == 0) by
(process_name) >= 0.6 * count
(procstat_running{process_name=~"contrail.*"}) by (process_name) |
Description |
Raises when 60% of the OpenContrail processes (by name) are in the
DOWN state. The process_name in the raised alert contains the
affected process name. |
Troubleshooting |
- Inspect the
ContrailProcessDown alerts for the host names of the
affected nodes.
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailProcessOutage
Severity |
Critical |
Summary |
All {{ $labels.process_name }} processes are down. |
Raise condition |
count(procstat_running{process_name=~"contrail.*"} == 0) by
(process_name) == count(procstat_running{process_name=~"contrail.*"}) by
(process_name) |
Description |
Raises when an OpenContrail process is in the DOWN state on all
nodes, indicating that the process is unavailable. The process_name
in the raised alert contains the affected process name. |
Troubleshooting |
- Inspect the
ContrailProcessDown alerts for the host names of the
affected nodes.
- Verify the service status using
systemctl status <service_name>
and the service logs in /var/log/contrail/ .
- If the process is still running, obtain the details about the service
state:
- Trace the system calls using
strace -p <pid> -e trace=network .
- List the open files, including the network sockets and devices
using
lsof -p <pid> .
- Analyze the packets sent to the port used by the service using
tcpdump -nei any port <portnum> -A -s 1500 .
|
Tuning |
Not required |
ContrailHealthCheckDisabled
Available starting from the 2019.2.4 maintenance update
Severity |
Critical |
Summary |
The OpenContrail health check is disabled. |
Raise condition |
absent(contrail_health_exit_code) == 1 |
Description |
Raises when the metric from the contrail-status check script is
absent. |
Troubleshooting |
Inspect the Telegraf logs on the ntw nodes. |
Tuning |
Not required |
ContrailHealthCheckFailed
Available starting from the 2019.2.4 maintenance update
Severity |
Critical |
Summary |
The OpenContrail health check failed for the
{{ $labels.contrail_service }} service on the
{{ $labels.host }} node. |
Raise condition |
contrail_health_exit_code != 0 |
Description |
Raises when any contrail service from the output of
contrail-status is inactive. The contrail_service label in
the raised alert contains the affected service name. |
Troubleshooting |
Inspect the affected service. |
Tuning |
Not required |
OpencontrailInstancePingCheckDown
Available starting from the 2019.2.4 maintenance update
Severity |
Major |
Summary |
The OpenContrail instance ping check on the {{ $labels.host }} node
is down for 2 minutes. |
Raise condition |
instance_ping_check_up == 0 |
Description |
Raises when the OpenContrail instance ping check on a node is down for
2 minutes. The host label in the raised alert contains the affected
node name. |
Tuning |
Not required |