Octavia
This section describes the alerts for Octavia.
OctaviaApiDown
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Octavia API is not accessible for all available Octavia endpoints in
the OpenStack service catalog for 2 minutes. |
Raise condition |
max(openstack_api_check_status{service="octavia-api"}) == 0
|
Description |
Raises when the checks against one available internal Octavia endpoint
in the OpenStack service catalog does not pass. Telegraf sends HTTP
requests to the URLs from the OpenStack service catalog and compares
the expected and actual HTTP response codes. The expected response code
for Octavia is 200 . For a list of all available endpoints, run
openstack endpoint list . |
Troubleshooting |
Verify the availability of internal Octavia endpoints (URLs) from the
output of the openstack endpoint list command. |
Tuning |
Not required |
OctaviaErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in Octavia logs on the
{{ $labels.host }} node is more than 0.2 error messages per second
(as measured over the last 5 minutes). |
Raise condition |
sum(rate(log_messages{service="octavia",level=~"error|emergency|
fatal"}[5m])) without (level) > 0.2
|
Description |
Raises when the average per-second rate of the error , fatal , or
emergency messages in the Octavia logs on the node is more than
0.2 per second. Fluentd forwards all logs from Octavia to Elasticsearch
and counts the number of log messages per severity. The host label
in the raised alert contains the host name of the affected node. |
Troubleshooting |
Inspect the log files in the /var/log/octavia/ directory on the
affected node. |
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, inspect the Octavia error logs in the Kibana web UI.
However, you can adjust the threshold to an acceptable error rate for a
particular environment. In the Prometheus Web UI, use the raise
condition query to view the appearance rate of a particular message type
in logs for a longer period of time and define the best threshold.
For example, to change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
OctaviaErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service=""octavia"", level=~""(?i:\
(error|emergency|fatal))""}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|