System

System¶

This section describes the system alerts.

SystemCpuFullWarning¶

Severity	Warning
Summary	The average CPU usage on the `{{ $labels.host }}` node is more than 90% for 2 minutes.
Raise condition	`100 - avg_over_time(cpu_usage_idle{cpu="cpu-total"}[5m]) > 90`
Description	Raises when the average CPU idle time on a node is less than 10% for the last 5 minutes, indicating that the node is under load. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `top` command on the affected node.
Tuning	For example, to change the threshold to `20%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemCpuFullWarning: if: >- 100 - avg_over_time(cpu_usage_idle{cpu=""cpu-total""}[5m]) > 80 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemLoadTooHighWarning¶

Severity	Warning
Summary	The system load per CPU on the `{{ $labels.host }}` node is more than `1` for 5 minutes.
Raise condition	In 2019.2.7 and prior: `system_load5 / system_n_cpus > 1` In 2019.2.8: `system_load15 / system_n_cpus > 1` In 2019.2.9 and newer: `system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > {{ load_threshold }}`
Description	Raises when the average load on the node is higher than `1` per CPU core over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `uptime` and `top` commands on the affected node.
Tuning	For example, to change the threshold to `1.5` per core: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter depending on the raise conditions provided above. For example: parameters: prometheus: server: alert: SystemLoadTooHighWarning: if: >- system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 1.5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemLoadTooHighCritical¶

Severity	Critical ^{Prior to 2019.2.8, warning}
Summary	The system load per CPU on the `{{ $labels.host }}` node is more than `2` for 5 minutes.
Raise condition	In 2019.2.7 and prior: `system_load5 / system_n_cpus > 2` In 2019.2.8: `system_load15 / system_n_cpus > 2` In 2019.2.9 and newer: `system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > {{ load_threshold }}`
Description	Raises when the average load on the node is higher than `2` per CPU over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the output of the `uptime` and `top` commands on the affected node.
Tuning	For example, to change the threshold to `3` per core: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter depending on the raise conditions provided above. For example: parameters: prometheus: server: alert: SystemLoadTooHighCritical: if: >- system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 3 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskFullWarning¶

Severity	Warning
Summary	The disk partition (`{{ $labels.path }}`) on the `{{ $labels.host }}` node is more than 85% full for 2 minutes.
Raise condition	`disk_used_percent >= 85`
Description	Raises when the disk partition on a node is 85% full. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free disk space on the node using the `df` command. Increase the disk space on the affected node or remove unused data.
Tuning	For example, to change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskFullWarning: if: >- disk_used_percent >= 90 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskFullMajor¶

Severity	Major
Summary	The disk partition (`{{ $labels.path }}`) on the `{{ $labels.host }}` node is 95% full for 2 minutes.
Raise condition	`disk_used_percent >= 95`
Description	Raises when the disk partition on a node is 95% full. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free disk space on the node using the `df` command. Increase the disk space on the affected node or remove unused data.
Tuning	For example, to change the threshold to `99%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskFullMajor: if: >- disk_used_percent >= 99 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskInodesFullWarning¶

Severity	Warning
Summary	The `{{ $labels.host }}` node uses more than 85% of disk inodes in the `{{ $labels.path }}` volume for 2 minutes.
Raise condition	`100 * disk_inodes_used / disk_inodes_total >= 85.0`
Description	Raises when the usage of inodes of a disk partition on the node is 85%. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free inodes on the affected node using the `df -i` command. If the disk is not full on the affected node, identify the reason for the inodes leak or remove unused files.
Tuning	For example, to change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskInodesFullWarning: if: >- 100 * disk_inodes_used / disk_inodes_total >= 90 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskInodesFullMajor¶

Severity	Major
Summary	The `{{ $labels.host }}` node uses more than 95% of disk inodes in the `{{ $labels.path }}` volume for 2 minutes.
Raise condition	`100 * disk_inodes_used / disk_inodes_total >= 95.0`
Description	Raises when the usage of inodes of a disk partition on the node is 95%. The `host`, `device`, and `path` labels in the raised alert contain the name of the affected node, device, and the path to the mount point.
Troubleshooting	Verify the used and free inodes on the affected node using the `df -i` command. If the disk is not full on the affected node, identify the reason for the inodes leak or remove unused files.
Tuning	For example, to change the threshold to `99%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskInodesFullMajor: if: >- 100 * disk_inodes_used / disk_inodes_total >= 99 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskErrorsTooHigh¶

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting errors for 5 minutes.
Raise condition	`increase(hdd_errors_total[1m]) > 0`
Description	Raises when disk error messages were detected every minute for the last 5 minutes in the syslog (`/var/log/syslog`) on a host. Fluentd parses the syslog for the `error.\b[sv]d[a-z]{1,2}\d{0,3}\b.` and `\b[sv]d[a-z]{1,2}\d{0,3}\b.*error` regular expressions and increases the count in case of success. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the `syslog` journal for `error` words using `grep`.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemDiskErrorsTooHigh: if: >- increase(hdd_errors_total[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemDiskBacklogWarning¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Warning
Summary	Disk `{{ $labels.name }}` backlog warning.
Raise condition	`rate(diskio_weighted_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z] ?[0-9]?)"}[1m]) / 1000 > 10`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node exceeded the concurrency level of 10 during the last 10 minutes.

SystemDiskBacklogCritical¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Critical
Summary	Disk `{{ $labels.name }}` backlog critical.
Raise condition	`rate(diskio_weighted_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z ?[0-9]?)"}[1m]) / 1000 > 20`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node exceeded the concurrency level of 20 during the last 10 minutes.

SystemDiskRequestQueuedWarning¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Warning
Summary	Disk `{{ $labels.name }}` requests were queued for 90% of time.
Raise condition	`rate(diskio_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z]?[0-9]?) "}[1m]) / 1000 > 0.9`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node spent 90% of the device time in queue during the last 10 minutes.

SystemDiskRequestQueuedCritical¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Critical
Summary	Disk `{{ $labels.name }}` requests were queued for 98% of time.
Raise condition	`rate(diskio_io_time{name=~"(hd[a-z]?\|sd[a-z]?\|nvme[0-9]?[a-z]?[0-9]?) "}[1m]) / 1000 > 0.98`
Description	I/O requests for the `{{ $labels.name }}` disk on the `{{ $labels.host }}` node spent 98% of the device time in queue during the last 10 minutes.

SystemMemoryFullWarning¶

Severity	Warning
Summary	The `{{ $labels.host }}` node uses `{{ $value }}%` of memory for 2 minutes.
Raise condition	`mem_used_percent > 90 and mem_available < 8 * 2^30`
Description	Raises when a node uses more than 90% of RAM and less than 8 GB of memory is available, indicating that the node is under a high load. The `host` label in the raised alert contains the name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Verify the free and used RAM on the affected node using `free -h`. Identify the service that consumes RAM.
Tuning	For example, to change the threshold to `80%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemMemoryFullWarning: if: >- mem_used_percent > 80 and mem_available < 8 * 2^30 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemMemoryFullMajor¶

Severity	Major
Summary	The `{{ $labels.host }}` node uses `{{ $value }}%` of memory for 2 minutes.
Raise condition	`mem_used_percent > 95 and mem_available < 4 * 2^30`
Description	Raises when a node uses more than 95% of RAM and less than 4 GB of memory is available, indicating that the node is under a high load. The `host` label in the raised alert contains the name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Verify the free and used RAM on the affected node using `free -h`. Identify the service that consumes RAM.
Tuning	For example, to change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemMemoryFullMajor: if: >- mem_used_percent > 90 and mem_available < 4 * 2^30 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSwapFullWarning¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Warning
Summary	The swap on the `{{ $labels.host }}` node is more than 50% used for 2 minutes.
Raise condition	`swap_used_percent >= 50.0`
Description	Raises when the swap on a node is 50% used, indicating that the node is under a high load or out of RAM. The `host` label in the raised alert contains the name of the affected node. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Troubleshooting	Verify the free and used RAM and swap on the affected node using `free -h`. Identify the service that consumes RAM and swap.
Tuning	Disable the alert as described in Manage alerts.

SystemSwapFullMinor¶

^{Removed since the 2019.2.4 maintenance update}

Severity	Minor
Summary	The swap on the `{{ $labels.host }}` node is more than 90% used for 2 minutes.
Raise condition	`swap_used_percent >= 90.0`
Description	Raises when the swap on a node is 90% used, indicating that the node is under a high load or out of RAM. The `host` label contains the name of the affected node. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Troubleshooting	Verify the free and used RAM and swap on the affected node using `free -h`. Identify the service that consumes RAM and swap.
Tuning	Disable the alert as described in Manage alerts.

SystemRxPacketsDroppedTooHigh¶

Severity	Warning
Summary	More than 60 packets received by the `{{ $labels.interface }}` interface on the `{{ $labels.host }}` node were dropped during the last minute.
Raise condition	`increase(net_drop_in[1m]) > 60 unless on (host,interface) bond_slave_active == 0`
Description	Raises when the number of dropped RX packets on an interface (except the bond slave interfaces that are in the `BACKUP` state) is higher than 60 for the last minute, according to the data from `/proc/net/dev` of the affected node. The `host` and `interface` labels in the raised alert contain the name of the affected node and the affected interface on that node. The reasons can be as follows: Full `softnet` backlog Wrong VLAN tags, packets received with unknown or unregistered protocols IPv6 frames in case if the server is configured only for IPv4 Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 600 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemRxPacketsDroppedTooHigh: if: >- increase(net_drop_in[1m]) > 600 unless on (host,interface) \ bond_slave_active == 0 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemTxPacketsDroppedTooHigh¶

Severity	Warning
Summary	More than 100 packets transmitted by the `{{ $labels.interface }}` interface on the `{{ $labels.host }}` node were dropped during the last minute.
Raise condition	`increase(net_drop_out[1m]) > 100`
Description	Raises when the number of dropped TX packets on the interface is higher than 100 for the last 1 minute, according to the data from `/proc/net/dev` of the affected node. The `host` and `interface` labels in the raised alert contain the name of the affected node and the affected interface on that node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 1000 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemTxPacketsDroppedTooHigh: if: >- increase(net_drop_out[1m]) > 1000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemRxPacketsErrorTooHigh¶

^{Available starting from the 2019.2.15 maintenance update}

Severity	Warning
Summary	`{{ net_rx_error_threshold }}{%- raw %}` received packets had errors.
Raise condition	`increase(net_err_in[1m]) > {{ net_rx_error_threshold }}`
Description	Raises when the number of packets received by an interface on a node had errors during the last minute. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 600 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemRxPacketsErrorTooHigh: if: >- increase(net_err_in[1m]) > 600 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemTxPacketsErrorTooHigh¶

^{Available starting from the 2019.2.15 maintenance update}

Severity	Warning
Summary	`{{ net_tx_error_threshold }}{%- raw %}` transmitted packets had errors.
Raise condition	`increase(net_err_out[1m]) > {{ net_tx_error_threshold }}`
Description	Raises when a number of packets transmitted by an interface on a node had errors during the last minute. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect the output of the `ip -s a` command.
Tuning	For example, to change the threshold to 1000 packets per 1 minute: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemTxPacketsErrorTooHigh: if: >- increase(net_err_out[1m]) > 1000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CronProcessDown¶

Severity	Critical
Summary	The cron process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="cron"} == 0`
Description	Raises when Telegraf cannot find running `cron` processes on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the `cron` service status using `systemctl status cron`. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

SshdProcessDown¶

Severity	Critical
Summary	The SSH process on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="sshd"} == 0`
Description	Raises when Telegraf cannot find running `sshd` processes on a node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the `sshd` service status using `systemctl status sshd`. Inspect the Telegraf logs using `journalctl -u telegraf`.
Tuning	Not required

SshFailedLoginsTooHigh¶

Severity	Warning
Summary	More than 5 failed SSH login attempts on the `{{ $labels.host }}` node during the last 5 minutes.
Raise condition	`increase(failed_logins_total[5m]) > 5`
Description	Raises when more than 5 `failed logins` log messages were detected in the syslog (`/var/log/syslog`) for the last 5 minutes. Fluentd parses the syslog for the `^Invalid user` regular expressions and increases the count in case of success. The `host` label in the raised alert contains the name of the affected node.
Troubleshooting	Inspect the `syslog` journal for `Invalid user` words using `grep`.
Tuning	For example, to change the threshold to 50 packets per 1 hour: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SshFailedLoginsTooHigh: if: >- increase(failed_logins_total[1h]) > 50 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

PacketsDroppedByCpuWarning¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.cpu }}` CPU on the `{{ $labels.host }}` node dropped `{{ $value }}` packets during the last 10 minutes.
Raise condition	`floor(increase(nstat_packet_drop[10m])) > 0`
Description	Raises when the number of packets dropped by CPU due to the lack of space in the processing queue is more than 0 for the last 10 minutes, according to the data in column 2 of the `/proc/net/softnet_stat` file. CPU starts to drop associated packets when its queue (backlog) is full because the interface receives packets faster than the kernel can process them. The `host` and `cpu` labels in the raised alert contain the name of the affected node and the CPU. Warning For production environments, configure the alert after deployment.
Troubleshooting	Increase the backlog size by modifying the `net.core.netdev_max_backlog` kernel parameter. For example: sudo sysctl -w net.core.netdev_max_backlog=3000 For details, see kernel documentation.
Tuning	For example, to change the threshold to 100 packets per 1 hour: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: PacketsDroppedByCpuWarning: if: >- floor(increase(nstat_packet_drop[1h])) > 100 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

PacketsDroppedByCpuMinor¶

Severity	Minor
Summary	The `{{ $labels.cpu }}` CPU on the `{{ $labels.host }}` node dropped `{{ $value }}` packets during the last 10 minutes.
Raise condition	`floor(increase(nstat_packet_drop[10m])) > 100`
Description	Raises when the number of packets dropped by CPU due to the lack of space in the processing queue is more than 100 for the last 10 minutes, according to the data in column 2 of the `/proc/net/softnet_stat` file. CPU starts to drop associated packets when its queue (backlog) is full because the interface receives packets faster than the kernel can process them. The `host` and `cpu` labels in the raised alert contain the name of the affected node and the CPU. Warning For production environments, configure the alert after deployment.
Troubleshooting	Increase the backlog size by modifying the `net.core.netdev_max_backlog` kernel parameter. For example: sudo sysctl -w net.core.netdev_max_backlog=3000 For details, see kernel documentation.
Tuning	For example, to change the threshold to 500 packets per 1 hour: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: PacketsDroppedByCpuMinor: if: >- floor(increase(nstat_packet_drop[1h])) > 500 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

NetdevBudgetRanOutsWarning¶

Severity	Warning
Summary	The rate of `net_rx_action` loops terminations on the `{{ $labels.host }}` node is `{{ $value }}` per second during the last 5 minutes.
Raise condition	`max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.1`
Description	Raises when the average rate of the `net_rx_action` loop terminations is greater than 0.1 per second for the last 5 minutes, according to the data in column 3 of the `/proc/net/softnet_stat` file. The alert typically indicates budget consumption or reaching of the time limit. The `net_rx_action` loop starts processing the packets from the memory to which the device transferred the packets through Direct Memory Access (DMA). Running out of budget or time can cause loop termination. Terminations of `net_rx_action` indicate that the CPU does not have enough quotas (budget or time) to proceed with all associated packets or some drivers encounter system issues. The `host` label in the raised alert contains the host name of the affected node. Warning For production environments, configure the alert after deployment.
Troubleshooting	Verify the number of dropped packets by CPU. If no dropped packets exist, increase the budget or time interval to resolve the issue: Increase the budget by modifying the `net.core.netdev_budget` kernel parameter. For example: sysctl -w net.core.netdev_budget=600 For details, see kernel documentation. Increase the time interval by modifying the `net.core.netdev_budget_usecs` kernel parameter. For example: sudo sysctl -w net.core.netdev_budget_usecs=5000 For details, see kernel documentation.
Tuning	For example, to change the threshold to `0.2`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: NetdevBudgetRanOutsWarning: if: >- max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.2 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemCpuIoWaitWarning¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Warning
Summary	CPU waited for I/O 40% of time.
Raise condition	`cpu_usage_iowait > 40`
Description	The CPU on the `{{ $labels.host }}` node spent 40% of time waiting for I/O.

SystemCpuIoWaitCritical¶

^{Available starting from the 2019.2.14 maintenance update}

Severity	Critical
Summary	CPU waited for I/O 50% of time.
Raise condition	`cpu_usage_iowait > 50`
Description	The CPU on the `{{ $labels.host }}` node spent 50% of time waiting for I/O.

SystemCpuStealTimeWarning¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Warning
Summary	The CPU steal time was above 5.0% on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`cpu_usage_steal > 5.0`
Description	Raises when a VM vCPU spends 5% of time waiting for real CPU for the last 5 minutes, typically occurring during high load in case of CPU shortage. Waiting for resources slows down the processes in the VM. Warning For production environments, configure the alert after deployment.
Tuning	For example, to change the threshold to `2%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemCpuStealTimeWarning: if: >- cpu_usage_steal > 2.0 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemCpuStealTimeCritical¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	The CPU steal time was above 10.0% on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`cpu_usage_steal > 10.0`
Description	Raises when a VM vCPU spends 10% of time waiting for real CPU for the last 5 minutes, typically occurring during high load in case of CPU shortage. Waiting for resources slows down the processes in the VM. Warning For production environments, configure the alert after deployment.
Tuning	For example, to change the threshold to `5%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemCpuStealTimeCritical: if: >- cpu_usage_steal > 5.0 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

updated: 2025-01-10 08:56

SSL certificates

View Previous Section

OpenStack

System

System¶

SystemCpuFullWarning¶

SystemLoadTooHighWarning¶

SystemLoadTooHighCritical¶

SystemDiskFullWarning¶

SystemDiskFullMajor¶

SystemDiskInodesFullWarning¶

SystemDiskInodesFullMajor¶

SystemDiskErrorsTooHigh¶

SystemDiskBacklogWarning¶

SystemDiskBacklogCritical¶

SystemDiskRequestQueuedWarning¶

SystemDiskRequestQueuedCritical¶

SystemMemoryFullWarning¶

SystemMemoryFullMajor¶

SystemSwapFullWarning¶

SystemSwapFullMinor¶

SystemRxPacketsDroppedTooHigh¶

SystemTxPacketsDroppedTooHigh¶

SystemRxPacketsErrorTooHigh¶

SystemTxPacketsErrorTooHigh¶

CronProcessDown¶

SshdProcessDown¶

SshFailedLoginsTooHigh¶

PacketsDroppedByCpuWarning¶

PacketsDroppedByCpuMinor¶

NetdevBudgetRanOutsWarning¶

SystemCpuIoWaitWarning¶

SystemCpuIoWaitCritical¶

SystemCpuStealTimeWarning¶

SystemCpuStealTimeCritical¶

View Previous Section

View Next Section