Ironic
Note
This feature is available starting from the MCP 2019.2.6 maintenance
update. Before using the feature, follow the steps
described in Apply maintenance updates.
This section describes the alerts for Ironic.
IronicErrorLogsTooHigh
Severity |
Warning |
Summary |
The average per-second rate of errors in Ironic logs on the
{{ $labels.host }} node is {{ $value }} (as measured over the
last 5 minutes). |
Raise condition |
sum(rate(log_messages{service="ironic",level=~"(?i:(error|emergency|
fatal))"}[5m])) without (level) > 0.2 |
Description |
Raises when the average per-second rate of error , fatal , or
emergency messages in Ironic logs on the node is more than 0.2 per
second, which is approximately 1 message per 5 seconds for all nodes in
the cluster. Fluentd forwards all logs from Ironic to Elasticsearch and
counts the number of log messages per severity. The host label in
the raised alert contains the host name of the affected node.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Inspect the log files in the /var/log/ironic/ directory on the
affected node. |
Tuning |
For example, to change the threshold to 0.1 (one error per every 10
seconds for the entire cluster):
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
IronicErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service=""ironic"",level=~""(?i:\
(error|emergency|fatal))""}[5m])) without (level) > 0.1
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
IronicProcessDown
Severity |
Minor |
Summary |
The {{ $labels.process_name }} process on the {{ $labels.host }}
node is down. |
Raise condition |
procstat_running{process_name=~"ironic-.*"} == 0 |
Description |
Raises when an Ironic process (API or conductor) on a host is down.
The process_name and host labels contain the name of the
affected process and the affected node. |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
|
Tuning |
Not required |
IronicProcessDownMinor
Severity |
Minor |
Summary |
The {{ $labels.process_name }} process is down on 33% of nodes. |
Raise condition |
count(procstat_running{process_name=~"ironic-.*"} == 0) by
(process_name) >= count(procstat_running{process_name=~"ironic-.*"}) by
(process_name) * 0.33 |
Description |
Raises when 33% of Ironic processes (API or conductor) are down. The
process_name label contains the name of the affected processes. |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
|
Tuning |
Not required |
IronicProcessDownMajor
Severity |
Major |
Summary |
The {{ $labels.process_name }} process is down on 66% of nodes. |
Raise condition |
count(procstat_running{process_name=~"ironic-.*"} == 0) by
(process_name) >= count(procstat_running{process_name=~"ironic-.*"}) by
(process_name) * 0.66 |
Description |
Raises when 66% of Ironic processes (API or conductor) are down. The
process_name label contains the name of the affected processes. |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
|
Tuning |
Not required |
IronicProcessOutage
Severity |
Critical |
Summary |
The {{ $labels.process_name }} process is down on all nodes. |
Raise condition |
count(procstat_running{process_name=~"ironic-.*"} == 0) by
(process_name) == count(procstat_running{process_name=~"ironic-.*"}) by
(process_name) |
Description |
All specified Ironic processes (API or conductor) are down. The
process_name label contains the name of the affected processes. |
Troubleshooting |
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
|
Tuning |
Not required |
IronicDriversMissing
Severity |
Major |
Summary |
The ironic-conductor {{ $labels.driver }} back-end driver is
missing on {{ $value }} node(s). |
Raise condition |
scalar(count(procstat_running{process_name=~"ironic-conductor"} == 1))
- count(openstack_ironic_driver) by (driver) > 0 |
Description |
Raises when Ironic conductors have a different number of back-end
drivers enabled. The cluster performance is not affected. However, the
cluster may loose HA. |
Troubleshooting |
Inspect the Drivers panel of the Ironic Grafana
dashboard for the nodes that have the disabled driver. |
Tuning |
Not required |
IronicApiEndpointDown
Severity |
Minor |
Summary |
The {{ $labels.name }} endpoint on the {{ $labels.host }} node
is not accessible for 2 minutes. |
Raise condition |
http_response_status{name=~"ironic-api.*"} == 0 |
Description |
Raises when an Ironic API endpoint (deploy or public API) was not
responding to HTTP health checks for 2 minutes. The name and
host labels contain the name of the affected endpoint and the
affected node. |
Troubleshooting |
- Inspect the
IronicProcessDown alert for the ironic-api
process.
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
|
Tuning |
Not required |
IronicApiEndpointsDownMajor
Severity |
Major |
Summary |
{{ $value }} of {{ $labels.name }} endpoints (>= 50%) are not
accessible for 2 minutes. |
Raise condition |
count(http_response_status{name=~"ironic-api.*"} == 0) by (name) >=
count(http_response_status{name=~"ironic-api.*"}) by (name) * 0.5 |
Description |
Raises when at least 50% of Ironic API endpoints (deploy or public API)
were not responding to HTTP health checks for 2 minutes. The name
label contains the name of the affected endpoint. |
Troubleshooting |
- Inspect the
IronicProcessDown alert for the ironic-api
process.
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
|
Tuning |
Not required |
IronicApiEndpointsOutage
Severity |
Critical |
Summary |
All available {{ $labels.name }} endpoints are not accessible for 2
minutes. |
Raise condition |
count(http_response_status{name=~"ironic-api.*"} == 0) by (name) ==
count(http_response_status{name=~"ironic-api.*"}) by (name) |
Description |
Raises when all Ironic API endpoints (deploy or public API) were not
responding to HTTP health checks for 2 minutes. The name label
contains the name of the affected endpoint. |
Troubleshooting |
- Inspect the
IronicProcessDown alert for the ironic-api
process.
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
|
Tuning |
Not required |
IronicApiOutage
Removed since the 2019.2.11 maintenance update
Severity |
Critical |
Summary |
Ironic API is not accessible for all available Ironic endpoints in the
OpenStack service catalog for 2 minutes. |
Raise condition |
max(openstack_api_check_status{service="ironic"}) == 0 |
Description |
Raises when the Ironic API or conductor service is in the DOWN state
on all ctl or bmt hosts. For the exact nodes and services,
inspect the host and process_name labels of the
IronicProcessDown alerts. |
Troubleshooting |
- Inspect the
IronicProcessDown alert for the ironic-api
process.
- Log in to the corresponding node and verify the process status using
systemctl status <process_name> .
- Inspect the log files in the
/var/log/ironic/<process_name>
directory.
- Verify the Telegraf
monitoring_remote_agent service:
- Verify the status of the
monitoring_remote_agent service using
docker service ls .
- Inspect the
monitoring_remote_agent service logs by running
docker service logs monitoring_remote_agent on one of the
mon nodes.
|
Tuning |
Not required |