Ironic

Ironic¶

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes the alerts for Ironic.

IronicErrorLogsTooHigh
IronicProcessDown
IronicProcessDownMinor
IronicProcessDownMajor
IronicProcessOutage
IronicDriversMissing
IronicApiEndpointDown
IronicApiEndpointsDownMajor
IronicApiEndpointsOutage
IronicApiOutage

IronicErrorLogsTooHigh¶

Severity	Warning
Summary	The average per-second rate of errors in Ironic logs on the `{{ $labels.host }}` node is `{{ $value }}` (as measured over the last 5 minutes).
Raise condition	`sum(rate(log_messages{service="ironic",level=~"(?i:(error\|emergency\| fatal))"}[5m])) without (level) > 0.2`
Description	Raises when the average per-second rate of `error`, `fatal`, or `emergency` messages in Ironic logs on the node is more than 0.2 per second, which is approximately 1 message per 5 seconds for all nodes in the cluster. Fluentd forwards all logs from Ironic to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the log files in the `/var/log/ironic/` directory on the affected node.
Tuning	For example, to change the threshold to 0.1 (one error per every 10 seconds for the entire cluster): On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: IronicErrorLogsTooHigh: if: >- sum(rate(log_messages{service=""ironic"",level=~""(?i:\ (error\|emergency\|fatal))""}[5m])) without (level) > 0.1 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

IronicProcessDown¶

Severity	Minor
Summary	The `{{ $labels.process_name }}` process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name=~"ironic-.*"} == 0`
Description	Raises when an Ironic process (API or conductor) on a host is down. The `process_name` and `host` labels contain the name of the affected process and the affected node.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicProcessDownMinor¶

Severity	Minor
Summary	The `{{ $labels.process_name }}` process is down on 33% of nodes.
Raise condition	`count(procstat_running{process_name=~"ironic-."} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-."}) by (process_name) * 0.33`
Description	Raises when 33% of Ironic processes (API or conductor) are down. The `process_name` label contains the name of the affected processes.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicProcessDownMajor¶

Severity	Major
Summary	The `{{ $labels.process_name }}` process is down on 66% of nodes.
Raise condition	`count(procstat_running{process_name=~"ironic-."} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-."}) by (process_name) * 0.66`
Description	Raises when 66% of Ironic processes (API or conductor) are down. The `process_name` label contains the name of the affected processes.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicProcessOutage¶

Severity	Critical
Summary	The `{{ $labels.process_name }}` process is down on all nodes.
Raise condition	`count(procstat_running{process_name=~"ironic-."} == 0) by (process_name) == count(procstat_running{process_name=~"ironic-."}) by (process_name)`
Description	All specified Ironic processes (API or conductor) are down. The `process_name` label contains the name of the affected processes.
Troubleshooting	Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicDriversMissing¶

Severity	Major
Summary	The `ironic-conductor` `{{ $labels.driver }}` back-end driver is missing on `{{ $value }}` node(s).
Raise condition	`scalar(count(procstat_running{process_name=~"ironic-conductor"} == 1)) - count(openstack_ironic_driver) by (driver) > 0`
Description	Raises when Ironic conductors have a different number of back-end drivers enabled. The cluster performance is not affected. However, the cluster may loose HA.
Troubleshooting	Inspect the Drivers panel of the Ironic Grafana dashboard for the nodes that have the disabled driver.
Tuning	Not required

IronicApiEndpointDown¶

Severity	Minor
Summary	The `{{ $labels.name }}` endpoint on the `{{ $labels.host }}` node is not accessible for 2 minutes.
Raise condition	`http_response_status{name=~"ironic-api.*"} == 0`
Description	Raises when an Ironic API endpoint (deploy or public API) was not responding to HTTP health checks for 2 minutes. The `name` and `host` labels contain the name of the affected endpoint and the affected node.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicApiEndpointsDownMajor¶

Severity	Major
Summary	`{{ $value }}` of `{{ $labels.name }}` endpoints (>= 50%) are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"ironic-api."} == 0) by (name) >= count(http_response_status{name=~"ironic-api."}) by (name) * 0.5`
Description	Raises when at least 50% of Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The `name` label contains the name of the affected endpoint.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicApiEndpointsOutage¶

Severity	Critical
Summary	All available `{{ $labels.name }}` endpoints are not accessible for 2 minutes.
Raise condition	`count(http_response_status{name=~"ironic-api."} == 0) by (name) == count(http_response_status{name=~"ironic-api."}) by (name)`
Description	Raises when all Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The `name` label contains the name of the affected endpoint.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory.
Tuning	Not required

IronicApiOutage¶

^{Removed since the 2019.2.11 maintenance update}

Severity	Critical
Summary	Ironic API is not accessible for all available Ironic endpoints in the OpenStack service catalog for 2 minutes.
Raise condition	`max(openstack_api_check_status{service="ironic"}) == 0`
Description	Raises when the Ironic API or conductor service is in the `DOWN` state on all `ctl` or `bmt` hosts. For the exact nodes and services, inspect the `host` and `process_name` labels of the `IronicProcessDown` alerts.
Troubleshooting	Inspect the `IronicProcessDown` alert for the `ironic-api` process. Log in to the corresponding node and verify the process status using `systemctl status <process_name>`. Inspect the log files in the `/var/log/ironic/<process_name>` directory. Verify the Telegraf `monitoring_remote_agent` service: Verify the status of the `monitoring_remote_agent` service using `docker service ls`. Inspect the `monitoring_remote_agent` service logs by running `docker service logs monitoring_remote_agent` on one of the `mon` nodes.
Tuning	Not required

updated: 2025-01-10 08:56

Heat

View Previous Section

Keystone

Ironic

Ironic¶

IronicErrorLogsTooHigh¶

IronicProcessDown¶

IronicProcessDownMinor¶

IronicProcessDownMajor¶

IronicProcessOutage¶

IronicDriversMissing¶

IronicApiEndpointDown¶

IronicApiEndpointsDownMajor¶

IronicApiEndpointsOutage¶

IronicApiOutage¶

View Previous Section

View Next Section