Ironic

Ironic

Note

This feature is available starting from the MCP 2019.2.6 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

This section describes the alerts for Ironic.


IronicErrorLogsTooHigh

Severity Warning
Summary The average per-second rate of errors in Ironic logs on the {{ $labels.host }} node is {{ $value }} (as measured over the last 5 minutes).
Raise condition sum(rate(log_messages{service="ironic",level=~"(?i:(error|emergency| fatal))"}[5m])) without (level) > 0.2
Description

Raises when the average per-second rate of error, fatal, or emergency messages in Ironic logs on the node is more than 0.2 per second, which is approximately 1 message per 5 seconds for all nodes in the cluster. Fluentd forwards all logs from Ironic to Elasticsearch and counts the number of log messages per severity. The host label in the raised alert contains the host name of the affected node.

Warning

For production environments, configure the alert after deployment.

Troubleshooting Inspect the log files in the /var/log/ironic/ directory on the affected node.
Tuning

For example, to change the threshold to 0.1 (one error per every 10 seconds for the entire cluster):

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            IronicErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service=""ironic"",level=~""(?i:\
                (error|emergency|fatal))""}[5m])) without (level) > 0.1
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.


IronicProcessDown

Severity Minor
Summary The {{ $labels.process_name }} process on the {{ $labels.host }} node is down.
Raise condition procstat_running{process_name=~"ironic-.*"} == 0
Description Raises when an Ironic process (API or conductor) on a host is down. The process_name and host labels contain the name of the affected process and the affected node.
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
Tuning Not required

IronicProcessDownMinor

Severity Minor
Summary The {{ $labels.process_name }} process is down on 33% of nodes.
Raise condition count(procstat_running{process_name=~"ironic-.*"} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-.*"}) by (process_name) * 0.33
Description Raises when 33% of Ironic processes (API or conductor) are down. The process_name label contains the name of the affected processes.
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
Tuning Not required

IronicProcessDownMajor

Severity Major
Summary The {{ $labels.process_name }} process is down on 66% of nodes.
Raise condition count(procstat_running{process_name=~"ironic-.*"} == 0) by (process_name) >= count(procstat_running{process_name=~"ironic-.*"}) by (process_name) * 0.66
Description Raises when 66% of Ironic processes (API or conductor) are down. The process_name label contains the name of the affected processes.
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
Tuning Not required

IronicProcessOutage

Severity Critical
Summary The {{ $labels.process_name }} process is down on all nodes.
Raise condition count(procstat_running{process_name=~"ironic-.*"} == 0) by (process_name) == count(procstat_running{process_name=~"ironic-.*"}) by (process_name)
Description All specified Ironic processes (API or conductor) are down. The process_name label contains the name of the affected processes.
Troubleshooting
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
Tuning Not required

IronicDriversMissing

Severity Major
Summary The ironic-conductor {{ $labels.driver }} back-end driver is missing on {{ $value }} node(s).
Raise condition scalar(count(procstat_running{process_name=~"ironic-conductor"} == 1)) - count(openstack_ironic_driver) by (driver) > 0
Description Raises when Ironic conductors have a different number of back-end drivers enabled. The cluster performance is not affected. However, the cluster may loose HA.
Troubleshooting Inspect the Drivers panel of the Ironic Grafana dashboard for the nodes that have the disabled driver.
Tuning Not required

IronicApiEndpointDown

Severity Minor
Summary The {{ $labels.name }} endpoint on the {{ $labels.host }} node is not accessible for 2 minutes.
Raise condition http_response_status{name=~"ironic-api.*"} == 0
Description Raises when an Ironic API endpoint (deploy or public API) was not responding to HTTP health checks for 2 minutes. The name and host labels contain the name of the affected endpoint and the affected node.
Troubleshooting
  • Inspect the IronicProcessDown alert for the ironic-api process.
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
Tuning Not required

IronicApiEndpointsDownMajor

Severity Major
Summary {{ $value }} of {{ $labels.name }} endpoints (>= 50%) are not accessible for 2 minutes.
Raise condition count(http_response_status{name=~"ironic-api.*"} == 0) by (name) >= count(http_response_status{name=~"ironic-api.*"}) by (name) * 0.5
Description Raises when at least 50% of Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The name label contains the name of the affected endpoint.
Troubleshooting
  • Inspect the IronicProcessDown alert for the ironic-api process.
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
Tuning Not required

IronicApiEndpointsOutage

Severity Critical
Summary All available {{ $labels.name }} endpoints are not accessible for 2 minutes.
Raise condition count(http_response_status{name=~"ironic-api.*"} == 0) by (name) == count(http_response_status{name=~"ironic-api.*"}) by (name)
Description Raises when all Ironic API endpoints (deploy or public API) were not responding to HTTP health checks for 2 minutes. The name label contains the name of the affected endpoint.
Troubleshooting
  • Inspect the IronicProcessDown alert for the ironic-api process.
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
Tuning Not required

IronicApiOutage

Removed since the 2019.2.11 maintenance update

Severity Critical
Summary Ironic API is not accessible for all available Ironic endpoints in the OpenStack service catalog for 2 minutes.
Raise condition max(openstack_api_check_status{service="ironic"}) == 0
Description Raises when the Ironic API or conductor service is in the DOWN state on all ctl or bmt hosts. For the exact nodes and services, inspect the host and process_name labels of the IronicProcessDown alerts.
Troubleshooting
  • Inspect the IronicProcessDown alert for the ironic-api process.
  • Log in to the corresponding node and verify the process status using systemctl status <process_name>.
  • Inspect the log files in the /var/log/ironic/<process_name> directory.
  • Verify the Telegraf monitoring_remote_agent service:
    • Verify the status of the monitoring_remote_agent service using docker service ls.
    • Inspect the monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on one of the mon nodes.
Tuning Not required