Documentation Portal

Octavia

Octavia¶

This section describes the alerts for Octavia.

OctaviaApiDown
OctaviaErrorLogsTooHigh

OctaviaApiDown¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Octavia API is not accessible for all available Octavia endpoints in the OpenStack service catalog for 2 minutes.
Raise condition	`max(openstack_api_check_status{service="octavia-api"}) == 0`
Description	Raises when the checks against one available internal Octavia endpoint in the OpenStack service catalog does not pass. Telegraf sends HTTP requests to the URLs from the OpenStack service catalog and compares the expected and actual HTTP response codes. The expected response code for Octavia is `200`. For a list of all available endpoints, run `openstack endpoint list`.
Troubleshooting	Verify the availability of internal Octavia endpoints (URLs) from the output of the `openstack endpoint list` command.
Tuning	Not required

OctaviaErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Octavia logs on the `{{ $labels.host }}` node is more than 0.2 error messages per second (as measured over the last 5 minutes).
Raise condition	`sum(rate(log_messages{service="octavia",level=~"error\|emergency\| fatal"}[5m])) without (level) > 0.2`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in the Octavia logs on the node is more than 0.2 per second. Fluentd forwards all logs from Octavia to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/octavia/` directory on the affected node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the Octavia error logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: OctaviaErrorLogsTooHigh: if: >- sum(rate(log_messages{service=""octavia"", level=~""(?i:\ (error\|emergency\|fatal))""}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

updated: 2025-01-10 08:56

Nova resources

View Previous Section

Kubernetes

View Next Section