System

System

This section describes the system alerts.


SystemCpuFullWarning

Severity

Warning

Summary

The average CPU usage on the {{ $labels.host }} node is more than 90% for 2 minutes.

Raise condition

100 - avg_over_time(cpu_usage_idle{cpu="cpu-total"}[5m]) > 90

Description

Raises when the average CPU idle time on a node is less than 10% for the last 5 minutes, indicating that the node is under load. The host label in the raised alert contains the name of the affected node.

Troubleshooting

Inspect the output of the top command on the affected node.

Tuning

For example, to change the threshold to 20%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemCpuFullWarning:
              if: >-
                100 - avg_over_time(cpu_usage_idle{cpu=""cpu-total""}[5m]) > 80
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemLoadTooHighWarning

Severity

Warning

Summary

The system load per CPU on the {{ $labels.host }} node is more than 1 for 5 minutes.

Raise condition

  • In 2019.2.7 and prior: system_load5 / system_n_cpus > 1

  • In 2019.2.8: system_load15 / system_n_cpus > 1

  • In 2019.2.9 and newer: system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > {{ load_threshold }}

Description

Raises when the average load on the node is higher than 1 per CPU core over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The host label in the raised alert contains the name of the affected node.

Troubleshooting

Inspect the output of the uptime and top commands on the affected node.

Tuning

For example, to change the threshold to 1.5 per core:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter depending on the raise conditions provided above. For example:

    parameters:
      prometheus:
        server:
          alert:
            SystemLoadTooHighWarning:
              if: >-
                system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 1.5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemLoadTooHighCritical

Severity

Critical Prior to 2019.2.8, warning

Summary

The system load per CPU on the {{ $labels.host }} node is more than 2 for 5 minutes.

Raise condition

  • In 2019.2.7 and prior: system_load5 / system_n_cpus > 2

  • In 2019.2.8: system_load15 / system_n_cpus > 2

  • In 2019.2.9 and newer: system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > {{ load_threshold }}

Description

Raises when the average load on the node is higher than 2 per CPU over the last 5 minutes, indicating that the system is overloaded, many processes are waiting for CPU time. The host label in the raised alert contains the name of the affected node.

Troubleshooting

Inspect the output of the uptime and top commands on the affected node.

Tuning

For example, to change the threshold to 3 per core:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter depending on the raise conditions provided above. For example:

    parameters:
      prometheus:
        server:
          alert:
            SystemLoadTooHighCritical:
              if: >-
                system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 3
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemDiskFullWarning

Severity

Warning

Summary

The disk partition ({{ $labels.path }}) on the {{ $labels.host }} node is more than 85% full for 2 minutes.

Raise condition

disk_used_percent >= 85

Description

Raises when the disk partition on a node is 85% full. The host, device, and path labels in the raised alert contain the name of the affected node, device, and the path to the mount point.

Troubleshooting

  • Verify the used and free disk space on the node using the df command.

  • Increase the disk space on the affected node or remove unused data.

Tuning

For example, to change the threshold to 90%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemDiskFullWarning:
              if: >-
                disk_used_percent >= 90
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemDiskFullMajor

Severity

Major

Summary

The disk partition ({{ $labels.path }}) on the {{ $labels.host }} node is 95% full for 2 minutes.

Raise condition

disk_used_percent >= 95

Description

Raises when the disk partition on a node is 95% full. The host, device, and path labels in the raised alert contain the name of the affected node, device, and the path to the mount point.

Troubleshooting

  • Verify the used and free disk space on the node using the df command.

  • Increase the disk space on the affected node or remove unused data.

Tuning

For example, to change the threshold to 99%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemDiskFullMajor:
              if: >-
                disk_used_percent >= 99
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemDiskInodesFullWarning

Severity

Warning

Summary

The {{ $labels.host }} node uses more than 85% of disk inodes in the {{ $labels.path }} volume for 2 minutes.

Raise condition

100 * disk_inodes_used / disk_inodes_total >= 85.0

Description

Raises when the usage of inodes of a disk partition on the node is 85%. The host, device, and path labels in the raised alert contain the name of the affected node, device, and the path to the mount point.

Troubleshooting

  • Verify the used and free inodes on the affected node using the df -i command.

  • If the disk is not full on the affected node, identify the reason for the inodes leak or remove unused files.

Tuning

For example, to change the threshold to 90%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemDiskInodesFullWarning:
              if: >-
                100 * disk_inodes_used / disk_inodes_total >= 90
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemDiskInodesFullMajor

Severity

Major

Summary

The {{ $labels.host }} node uses more than 95% of disk inodes in the {{ $labels.path }} volume for 2 minutes.

Raise condition

100 * disk_inodes_used / disk_inodes_total >= 95.0

Description

Raises when the usage of inodes of a disk partition on the node is 95%. The host, device, and path labels in the raised alert contain the name of the affected node, device, and the path to the mount point.

Troubleshooting

  • Verify the used and free inodes on the affected node using the df -i command.

  • If the disk is not full on the affected node, identify the reason for the inodes leak or remove unused files.

Tuning

For example, to change the threshold to 99%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemDiskInodesFullMajor:
              if: >-
                100 * disk_inodes_used / disk_inodes_total >= 99
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemDiskErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting errors for 5 minutes.

Raise condition

increase(hdd_errors_total[1m]) > 0

Description

Raises when disk error messages were detected every minute for the last 5 minutes in the syslog (/var/log/syslog) on a host. Fluentd parses the syslog for the error.*\b[sv]d[a-z]{1,2}\d{0,3}\b.* and \b[sv]d[a-z]{1,2}\d{0,3}\b.*error regular expressions and increases the count in case of success. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the syslog journal for error words using grep.

Tuning

For example, to change the threshold to 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemDiskErrorsTooHigh:
              if: >-
                increase(hdd_errors_total[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemDiskBacklogWarning

Available starting from the 2019.2.14 maintenance update

Severity

Warning

Summary

Disk {{ $labels.name }} backlog warning.

Raise condition

rate(diskio_weighted_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z] ?[0-9]?)"}[1m]) / 1000 > 10

Description

I/O requests for the {{ $labels.name }} disk on the {{ $labels.host }} node exceeded the concurrency level of 10 during the last 10 minutes.

SystemDiskBacklogCritical

Available starting from the 2019.2.14 maintenance update

Severity

Critical

Summary

Disk {{ $labels.name }} backlog critical.

Raise condition

rate(diskio_weighted_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z ?[0-9]?)"}[1m]) / 1000 > 20

Description

I/O requests for the {{ $labels.name }} disk on the {{ $labels.host }} node exceeded the concurrency level of 20 during the last 10 minutes.

SystemDiskRequestQueuedWarning

Available starting from the 2019.2.14 maintenance update

Severity

Warning

Summary

Disk {{ $labels.name }} requests were queued for 90% of time.

Raise condition

rate(diskio_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z]?[0-9]?) "}[1m]) / 1000 > 0.9

Description

I/O requests for the {{ $labels.name }} disk on the {{ $labels.host }} node spent 90% of the device time in queue during the last 10 minutes.

SystemDiskRequestQueuedCritical

Available starting from the 2019.2.14 maintenance update

Severity

Critical

Summary

Disk {{ $labels.name }} requests were queued for 98% of time.

Raise condition

rate(diskio_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z]?[0-9]?) "}[1m]) / 1000 > 0.98

Description

I/O requests for the {{ $labels.name }} disk on the {{ $labels.host }} node spent 98% of the device time in queue during the last 10 minutes.

SystemMemoryFullWarning

Severity

Warning

Summary

The {{ $labels.host }} node uses {{ $value }}% of memory for 2 minutes.

Raise condition

mem_used_percent > 90 and mem_available < 8 * 2^30

Description

Raises when a node uses more than 90% of RAM and less than 8 GB of memory is available, indicating that the node is under a high load. The host label in the raised alert contains the name of the affected node.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

  • Verify the free and used RAM on the affected node using free -h.

  • Identify the service that consumes RAM.

Tuning

For example, to change the threshold to 80%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemMemoryFullWarning:
              if: >-
                mem_used_percent > 80 and mem_available < 8 * 2^30
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemMemoryFullMajor

Severity

Major

Summary

The {{ $labels.host }} node uses {{ $value }}% of memory for 2 minutes.

Raise condition

mem_used_percent > 95 and mem_available < 4 * 2^30

Description

Raises when a node uses more than 95% of RAM and less than 4 GB of memory is available, indicating that the node is under a high load. The host label in the raised alert contains the name of the affected node.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

  • Verify the free and used RAM on the affected node using free -h.

  • Identify the service that consumes RAM.

Tuning

For example, to change the threshold to 90%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemMemoryFullMajor:
              if: >-
                mem_used_percent > 90 and mem_available < 4 * 2^30
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSwapFullWarning

Removed since the 2019.2.4 maintenance update

Severity

Warning

Summary

The swap on the {{ $labels.host }} node is more than 50% used for 2 minutes.

Raise condition

swap_used_percent >= 50.0

Description

Raises when the swap on a node is 50% used, indicating that the node is under a high load or out of RAM. The host label in the raised alert contains the name of the affected node.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Troubleshooting

  • Verify the free and used RAM and swap on the affected node using free -h.

  • Identify the service that consumes RAM and swap.

Tuning

Disable the alert as described in Manage alerts.

SystemSwapFullMinor

Removed since the 2019.2.4 maintenance update

Severity

Minor

Summary

The swap on the {{ $labels.host }} node is more than 90% used for 2 minutes.

Raise condition

swap_used_percent >= 90.0

Description

Raises when the swap on a node is 90% used, indicating that the node is under a high load or out of RAM. The host label contains the name of the affected node.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Troubleshooting

  • Verify the free and used RAM and swap on the affected node using free -h.

  • Identify the service that consumes RAM and swap.

Tuning

Disable the alert as described in Manage alerts.

SystemRxPacketsDroppedTooHigh

Severity

Warning

Summary

More than 60 packets received by the {{ $labels.interface }} interface on the {{ $labels.host }} node were dropped during the last minute.

Raise condition

increase(net_drop_in[1m]) > 60 unless on (host,interface) bond_slave_active == 0

Description

Raises when the number of dropped RX packets on an interface (except the bond slave interfaces that are in the BACKUP state) is higher than 60 for the last minute, according to the data from /proc/net/dev of the affected node. The host and interface labels in the raised alert contain the name of the affected node and the affected interface on that node. The reasons can be as follows:

  • Full softnet backlog

  • Wrong VLAN tags, packets received with unknown or unregistered protocols

  • IPv6 frames in case if the server is configured only for IPv4

Warning

For production environments, configure the alert after deployment.

Troubleshooting

Inspect the output of the ip -s a command.

Tuning

For example, to change the threshold to 600 packets per 1 minute:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemRxPacketsDroppedTooHigh:
              if: >-
                increase(net_drop_in[1m]) > 600 unless on (host,interface) \
                bond_slave_active == 0
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemTxPacketsDroppedTooHigh

Severity

Warning

Summary

More than 100 packets transmitted by the {{ $labels.interface }} interface on the {{ $labels.host }} node were dropped during the last minute.

Raise condition

increase(net_drop_out[1m]) > 100

Description

Raises when the number of dropped TX packets on the interface is higher than 100 for the last 1 minute, according to the data from /proc/net/dev of the affected node. The host and interface labels in the raised alert contain the name of the affected node and the affected interface on that node.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

Inspect the output of the ip -s a command.

Tuning

For example, to change the threshold to 1000 packets per 1 minute:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemTxPacketsDroppedTooHigh:
              if: >-
                increase(net_drop_out[1m]) > 1000
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemRxPacketsErrorTooHigh

Available starting from the 2019.2.15 maintenance update

Severity

Warning

Summary

{{ net_rx_error_threshold }}{%- raw %} received packets had errors.

Raise condition

increase(net_err_in[1m]) > {{ net_rx_error_threshold }}

Description

Raises when the number of packets received by an interface on a node had errors during the last minute.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

Inspect the output of the ip -s a command.

Tuning

For example, to change the threshold to 600 packets per 1 minute:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemRxPacketsErrorTooHigh:
              if: >-
                increase(net_err_in[1m]) > 600
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemTxPacketsErrorTooHigh

Available starting from the 2019.2.15 maintenance update

Severity

Warning

Summary

{{ net_tx_error_threshold }}{%- raw %} transmitted packets had errors.

Raise condition

increase(net_err_out[1m]) > {{ net_tx_error_threshold }}

Description

Raises when a number of packets transmitted by an interface on a node had errors during the last minute.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

Inspect the output of the ip -s a command.

Tuning

For example, to change the threshold to 1000 packets per 1 minute:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemTxPacketsErrorTooHigh:
              if: >-
                increase(net_err_out[1m]) > 1000
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

CronProcessDown

Severity

Critical

Summary

The cron process on the {{ $labels.host }} node is down.

Raise condition

procstat_running{process_name="cron"} == 0

Description

Raises when Telegraf cannot find running cron processes on a node. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the cron service status using systemctl status cron.

  • Inspect the Telegraf logs using journalctl -u telegraf.

Tuning

Not required

SshdProcessDown

Severity

Critical

Summary

The SSH process on the {{ $labels.host }} node is down.

Raise condition

procstat_running{process_name="sshd"} == 0

Description

Raises when Telegraf cannot find running sshd processes on a node. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the sshd service status using systemctl status sshd.

  • Inspect the Telegraf logs using journalctl -u telegraf.

Tuning

Not required

SshFailedLoginsTooHigh

Severity

Warning

Summary

More than 5 failed SSH login attempts on the {{ $labels.host }} node during the last 5 minutes.

Raise condition

increase(failed_logins_total[5m]) > 5

Description

Raises when more than 5 failed logins log messages were detected in the syslog (/var/log/syslog) for the last 5 minutes. Fluentd parses the syslog for the ^Invalid user regular expressions and increases the count in case of success. The host label in the raised alert contains the name of the affected node.

Troubleshooting

Inspect the syslog journal for Invalid user words using grep.

Tuning

For example, to change the threshold to 50 packets per 1 hour:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SshFailedLoginsTooHigh:
              if: >-
                increase(failed_logins_total[1h]) > 50
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

PacketsDroppedByCpuWarning

Available starting from the 2019.2.3 maintenance update

Severity

Warning

Summary

The {{ $labels.cpu }} CPU on the {{ $labels.host }} node dropped {{ $value }} packets during the last 10 minutes.

Raise condition

floor(increase(nstat_packet_drop[10m])) > 0

Description

Raises when the number of packets dropped by CPU due to the lack of space in the processing queue is more than 0 for the last 10 minutes, according to the data in column 2 of the /proc/net/softnet_stat file. CPU starts to drop associated packets when its queue (backlog) is full because the interface receives packets faster than the kernel can process them. The host and cpu labels in the raised alert contain the name of the affected node and the CPU.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

Increase the backlog size by modifying the net.core.netdev_max_backlog kernel parameter. For example:

sudo sysctl -w net.core.netdev_max_backlog=3000

For details, see kernel documentation.

Tuning

For example, to change the threshold to 100 packets per 1 hour:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            PacketsDroppedByCpuWarning:
              if: >-
                floor(increase(nstat_packet_drop[1h])) > 100
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

PacketsDroppedByCpuMinor

Severity

Minor

Summary

The {{ $labels.cpu }} CPU on the {{ $labels.host }} node dropped {{ $value }} packets during the last 10 minutes.

Raise condition

floor(increase(nstat_packet_drop[10m])) > 100

Description

Raises when the number of packets dropped by CPU due to the lack of space in the processing queue is more than 100 for the last 10 minutes, according to the data in column 2 of the /proc/net/softnet_stat file. CPU starts to drop associated packets when its queue (backlog) is full because the interface receives packets faster than the kernel can process them. The host and cpu labels in the raised alert contain the name of the affected node and the CPU.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

Increase the backlog size by modifying the net.core.netdev_max_backlog kernel parameter. For example:

sudo sysctl -w net.core.netdev_max_backlog=3000

For details, see kernel documentation.

Tuning

For example, to change the threshold to 500 packets per 1 hour:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            PacketsDroppedByCpuMinor:
              if: >-
                floor(increase(nstat_packet_drop[1h])) > 500
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

NetdevBudgetRanOutsWarning

Severity

Warning

Summary

The rate of net_rx_action loops terminations on the {{ $labels.host }} node is {{ $value }} per second during the last 5 minutes.

Raise condition

max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.1

Description

Raises when the average rate of the net_rx_action loop terminations is greater than 0.1 per second for the last 5 minutes, according to the data in column 3 of the /proc/net/softnet_stat file. The alert typically indicates budget consumption or reaching of the time limit. The net_rx_action loop starts processing the packets from the memory to which the device transferred the packets through Direct Memory Access (DMA). Running out of budget or time can cause loop termination. Terminations of net_rx_action indicate that the CPU does not have enough quotas (budget or time) to proceed with all associated packets or some drivers encounter system issues. The host label in the raised alert contains the host name of the affected node.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

  • Verify the number of dropped packets by CPU.

  • If no dropped packets exist, increase the budget or time interval to resolve the issue:

    • Increase the budget by modifying the net.core.netdev_budget kernel parameter. For example:

      sysctl -w net.core.netdev_budget=600
      

      For details, see kernel documentation.

    • Increase the time interval by modifying the net.core.netdev_budget_usecs kernel parameter. For example:

      sudo sysctl -w net.core.netdev_budget_usecs=5000
      

      For details, see kernel documentation.

Tuning

For example, to change the threshold to 0.2:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            NetdevBudgetRanOutsWarning:
              if: >-
                max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.2
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemCpuIoWaitWarning

Available starting from the 2019.2.14 maintenance update

Severity

Warning

Summary

CPU waited for I/O 40% of time.

Raise condition

cpu_usage_iowait > 40

Description

The CPU on the {{ $labels.host }} node spent 40% of time waiting for I/O.

SystemCpuIoWaitCritical

Available starting from the 2019.2.14 maintenance update

Severity

Critical

Summary

CPU waited for I/O 50% of time.

Raise condition

cpu_usage_iowait > 50

Description

The CPU on the {{ $labels.host }} node spent 50% of time waiting for I/O.

SystemCpuStealTimeWarning

Available starting from the 2019.2.6 maintenance update

Severity

Warning

Summary

The CPU steal time was above 5.0% on the {{ $labels.host }} node for 5 minutes.

Raise condition

cpu_usage_steal > 5.0

Description

Raises when a VM vCPU spends 5% of time waiting for real CPU for the last 5 minutes, typically occurring during high load in case of CPU shortage. Waiting for resources slows down the processes in the VM.

Warning

For production environments, configure the alert after deployment.

Tuning

For example, to change the threshold to 2%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemCpuStealTimeWarning:
              if: >-
                cpu_usage_steal > 2.0
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemCpuStealTimeCritical

Available starting from the 2019.2.6 maintenance update

Severity

Critical

Summary

The CPU steal time was above 10.0% on the {{ $labels.host }} node for 5 minutes.

Raise condition

cpu_usage_steal > 10.0

Description

Raises when a VM vCPU spends 10% of time waiting for real CPU for the last 5 minutes, typically occurring during high load in case of CPU shortage. Waiting for resources slows down the processes in the VM.

Warning

For production environments, configure the alert after deployment.

Tuning

For example, to change the threshold to 5%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemCpuStealTimeCritical:
              if: >-
                cpu_usage_steal > 5.0
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.