Ironic

Ironic

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes the alerts for Ironic.


IronicErrorLogsTooHigh

Severity

Warning

Summary

The average per-second rate of errors in Ironic logs on the {{ $labels.host }} node is {{ $value }} (as measured over the last 5 minutes).

Raise condition

sum(rate(log_messages{service="ironic",level=~"(?i:(error|emergency| fatal))"}[5m])) without (level) > 0.2

Description

Raises when the average per-second rate of error, fatal, or emergency messages in Ironic logs on the node is more than 0.2 per second, which is approximately 1 message per 5 seconds for all nodes in the cluster. Fluentd forwards all logs from Ironic to Elasticsearch and counts the number of log messages per severity. The host label in the raised alert contains the host name of the affected node.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

Inspect the log files in the /var/log/ironic/ directory on the affected node.

Tuning

For example, to change the threshold to 0.1 (one error per every 10 seconds for the entire cluster):

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            IronicErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service=""ironic"",level=~""(?i:\
                (error|emergency|fatal))""}[5m])) without (level) > 0.1
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.


IronicProcessDown

Severity

Minor

Summary

The {{ $labels.process_name }} process on the {{ $labels.host }} node is down.

Raise condition

procstat_running{process_name=~"ironic-.*"} == 0

Description

Raises when an Ironic process (API or conductor) on a host is down. The process_name and host labels contain the name of the affected process and the affected node.

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

Tuning

Not required


IronicProcessDownMinor

Severity

Minor

Summary

The {{ $labels.process_name }} process is down on 33% of nodes.

Raise condition

count(procstat_running{process_name=~"ironic-.*"} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-.*"}) by (process_name) * 0.33

Description

Raises when 33% of Ironic processes (API or conductor) are down. The process_name label contains the name of the affected processes.

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

Tuning

Not required


IronicProcessDownMajor

Severity

Major

Summary

The {{ $labels.process_name }} process is down on 66% of nodes.

Raise condition

count(procstat_running{process_name=~"ironic-.*"} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-.*"}) by (process_name) * 0.66

Description

Raises when 66% of Ironic processes (API or conductor) are down. The process_name label contains the name of the affected processes.

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

Tuning

Not required


IronicProcessOutage

Severity

Critical

Summary

The {{ $labels.process_name }} process is down on all nodes.

Raise condition

count(procstat_running{process_name=~"ironic-.*"} == 0) by (process_name) == count(procstat_running{process_name=~"ironic-.*"}) by (process_name)

Description

All specified Ironic processes (API or conductor) are down. The process_name label contains the name of the affected processes.

Troubleshooting

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

Tuning

Not required


IronicDriversMissing

Severity

Major

Summary

The ironic-conductor {{ $labels.driver }} back-end driver is missing on {{ $value }} node(s).

Raise condition

scalar(count(procstat_running{process_name=~"ironic-conductor"} == 1)) - count(openstack_ironic_driver) by (driver) > 0

Description

Raises when Ironic conductors have a different number of back-end drivers enabled. The cluster performance is not affected. However, the cluster may loose HA.

Troubleshooting

Inspect the Drivers panel of the Ironic Grafana dashboard for the nodes that have the disabled driver.

Tuning

Not required


IronicApiEndpointDown

Severity

Minor

Summary

The {{ $labels.name }} endpoint on the {{ $labels.host }} node is not accessible for 2 minutes.

Raise condition

http_response_status{name=~"ironic-api.*"} == 0

Description

Raises when an Ironic API endpoint (deploy or public API) was not responding to HTTP health checks for 2 minutes. The name and host labels contain the name of the affected endpoint and the affected node.

Troubleshooting

  • Inspect the IronicProcessDown alert for the ironic-api process.

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

Tuning

Not required


IronicApiEndpointsDownMajor

Severity

Major

Summary

{{ $value }} of {{ $labels.name }} endpoints (>= 50%) are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"ironic-api.*"} == 0) by (name) >= count(http_response_status{name=~"ironic-api.*"}) by (name) * 0.5

Description

Raises when at least 50% of Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The name label contains the name of the affected endpoint.

Troubleshooting

  • Inspect the IronicProcessDown alert for the ironic-api process.

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

Tuning

Not required


IronicApiEndpointsOutage

Severity

Critical

Summary

All available {{ $labels.name }} endpoints are not accessible for 2 minutes.

Raise condition

count(http_response_status{name=~"ironic-api.*"} == 0) by (name) == count(http_response_status{name=~"ironic-api.*"}) by (name)

Description

Raises when all Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The name label contains the name of the affected endpoint.

Troubleshooting

  • Inspect the IronicProcessDown alert for the ironic-api process.

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

Tuning

Not required


IronicApiOutage

Removed since the 2019.2.11 maintenance update

Severity

Critical

Summary

Ironic API is not accessible for all available Ironic endpoints in the OpenStack service catalog for 2 minutes.

Raise condition

max(openstack_api_check_status{service="ironic"}) == 0

Description

Raises when the Ironic API or conductor service is in the DOWN state on all ctl or bmt hosts. For the exact nodes and services, inspect the host and process_name labels of the IronicProcessDown alerts.

Troubleshooting

  • Inspect the IronicProcessDown alert for the ironic-api process.

  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.

  • Inspect the log files in the /var/log/ironic/<process_name> directory.

  • Verify the Telegraf monitoring_remote_agent service:

    • Verify the status of the monitoring_remote_agent service using docker service ls.

    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.

Tuning

Not required