System
This section describes the system alerts.
SystemCpuFullWarning
Severity |
Warning |
Summary |
The average CPU usage on the {{ $labels.host }} node is more than
90% for 2 minutes. |
Raise condition |
100 - avg_over_time(cpu_usage_idle{cpu="cpu-total"}[5m]) > 90 |
Description |
Raises when the average CPU idle time on a node is less than 10% for
the last 5 minutes, indicating that the node is under load. The
host label in the raised alert contains the name of the affected
node. |
Troubleshooting |
Inspect the output of the top command on the affected node. |
Tuning |
For example, to change the threshold to 20% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemCpuFullWarning:
if: >-
100 - avg_over_time(cpu_usage_idle{cpu=""cpu-total""}[5m]) > 80
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemLoadTooHighWarning
Severity |
Warning |
Summary |
The system load per CPU on the {{ $labels.host }} node is more than
1 for 5 minutes. |
Raise condition |
- In 2019.2.7 and prior:
system_load5 / system_n_cpus > 1
- In 2019.2.8:
system_load15 / system_n_cpus > 1
- In 2019.2.9 and newer:
system_load15{host!~".*cmp[0-9]+"} /
system_n_cpus > {{ load_threshold }}
|
Description |
Raises when the average load on the node is higher than 1 per CPU
core over the last 5 minutes, indicating that the system is overloaded,
many processes are waiting for CPU time. The host label in the
raised alert contains the name of the affected node. |
Troubleshooting |
Inspect the output of the uptime and top commands on the
affected node. |
Tuning |
For example, to change the threshold to 1.5 per core:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter depending on the raise conditions
provided above. For example:
parameters:
prometheus:
server:
alert:
SystemLoadTooHighWarning:
if: >-
system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 1.5
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemLoadTooHighCritical
Severity |
Critical Prior to 2019.2.8, warning |
Summary |
The system load per CPU on the {{ $labels.host }} node is more than
2 for 5 minutes. |
Raise condition |
- In 2019.2.7 and prior:
system_load5 / system_n_cpus > 2
- In 2019.2.8:
system_load15 / system_n_cpus > 2
- In 2019.2.9 and newer:
system_load15{host!~".*cmp[0-9]+"} /
system_n_cpus > {{ load_threshold }}
|
Description |
Raises when the average load on the node is higher than 2 per CPU
over the last 5 minutes, indicating that the system is overloaded, many
processes are waiting for CPU time. The host label in the raised
alert contains the name of the affected node. |
Troubleshooting |
Inspect the output of the uptime and top commands on the
affected node. |
Tuning |
For example, to change the threshold to 3 per core:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter depending on the raise conditions
provided above. For example:
parameters:
prometheus:
server:
alert:
SystemLoadTooHighCritical:
if: >-
system_load15{host!~".*cmp[0-9]+"} / system_n_cpus > 3
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemDiskFullWarning
Severity |
Warning |
Summary |
The disk partition ({{ $labels.path }} ) on the
{{ $labels.host }} node is more than 85% full for 2 minutes. |
Raise condition |
disk_used_percent >= 85 |
Description |
Raises when the disk partition on a node is 85% full. The host ,
device , and path labels in the raised alert contain the name of
the affected node, device, and the path to the mount point. |
Troubleshooting |
- Verify the used and free disk space on the node using the
df
command.
- Increase the disk space on the affected node or remove unused data.
|
Tuning |
For example, to change the threshold to 90% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemDiskFullWarning:
if: >-
disk_used_percent >= 90
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemDiskFullMajor
Severity |
Major |
Summary |
The disk partition ({{ $labels.path }} ) on the
{{ $labels.host }} node is 95% full for 2 minutes. |
Raise condition |
disk_used_percent >= 95 |
Description |
Raises when the disk partition on a node is 95% full. The host ,
device , and path labels in the raised alert contain the name of
the affected node, device, and the path to the mount point. |
Troubleshooting |
- Verify the used and free disk space on the node using the
df
command.
- Increase the disk space on the affected node or remove unused data.
|
Tuning |
For example, to change the threshold to 99% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemDiskFullMajor:
if: >-
disk_used_percent >= 99
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemDiskInodesFullWarning
Severity |
Warning |
Summary |
The {{ $labels.host }} node uses more than 85% of disk inodes in the
{{ $labels.path }} volume for 2 minutes. |
Raise condition |
100 * disk_inodes_used / disk_inodes_total >= 85.0 |
Description |
Raises when the usage of inodes of a disk partition on the node is 85%.
The host , device , and path labels in the raised alert
contain the name of the affected node, device, and the path to the mount
point. |
Troubleshooting |
- Verify the used and free inodes on the affected node using the
df -i command.
- If the disk is not full on the affected node, identify the reason for
the inodes leak or remove unused files.
|
Tuning |
For example, to change the threshold to 90% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemDiskInodesFullWarning:
if: >-
100 * disk_inodes_used / disk_inodes_total >= 90
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemDiskInodesFullMajor
Severity |
Major |
Summary |
The {{ $labels.host }} node uses more than 95% of disk inodes in the
{{ $labels.path }} volume for 2 minutes. |
Raise condition |
100 * disk_inodes_used / disk_inodes_total >= 95.0 |
Description |
Raises when the usage of inodes of a disk partition on the node is 95%.
The host , device , and path labels in the raised alert
contain the name of the affected node, device, and the path to the mount
point. |
Troubleshooting |
- Verify the used and free inodes on the affected node using the
df -i command.
- If the disk is not full on the affected node, identify the reason for
the inodes leak or remove unused files.
|
Tuning |
For example, to change the threshold to 99% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemDiskInodesFullMajor:
if: >-
100 * disk_inodes_used / disk_inodes_total >= 99
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemDiskErrorsTooHigh
Severity |
Warning |
Summary |
The {{ $labels.device }} disk on the {{ $labels.host }} node is
reporting errors for 5 minutes. |
Raise condition |
increase(hdd_errors_total[1m]) > 0 |
Description |
Raises when disk error messages were detected every minute for the last
5 minutes in the syslog (/var/log/syslog ) on a host. Fluentd parses
the syslog for the error.*\b[sv]d[a-z]{1,2}\d{0,3}\b.* and
\b[sv]d[a-z]{1,2}\d{0,3}\b.*error regular expressions and increases
the count in case of success. The host and device labels in the
raised alert contain the name of the affected node and the affected
device. |
Troubleshooting |
Inspect the syslog journal for error words using grep . |
Tuning |
For example, to change the threshold to 5 errors during 5 minutes:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemDiskErrorsTooHigh:
if: >-
increase(hdd_errors_total[5m]) > 5
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemDiskBacklogWarning
Available starting from the 2019.2.14 maintenance update
Severity |
Warning |
Summary |
Disk {{ $labels.name }} backlog warning. |
Raise condition |
rate(diskio_weighted_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z]
?[0-9]?)"}[1m]) / 1000 > 10 |
Description |
I/O requests for the {{ $labels.name }} disk on the
{{ $labels.host }} node exceeded the concurrency level of 10 during
the last 10 minutes. |
SystemDiskBacklogCritical
Available starting from the 2019.2.14 maintenance update
Severity |
Critical |
Summary |
Disk {{ $labels.name }} backlog critical. |
Raise condition |
rate(diskio_weighted_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z
?[0-9]?)"}[1m]) / 1000 > 20 |
Description |
I/O requests for the {{ $labels.name }} disk on the
{{ $labels.host }} node exceeded the concurrency level of 20 during
the last 10 minutes. |
SystemDiskRequestQueuedWarning
Available starting from the 2019.2.14 maintenance update
Severity |
Warning |
Summary |
Disk {{ $labels.name }} requests were queued for 90% of time. |
Raise condition |
rate(diskio_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z]?[0-9]?)
"}[1m]) / 1000 > 0.9 |
Description |
I/O requests for the {{ $labels.name }} disk on the
{{ $labels.host }} node spent 90% of the device time in queue during
the last 10 minutes. |
SystemDiskRequestQueuedCritical
Available starting from the 2019.2.14 maintenance update
Severity |
Critical |
Summary |
Disk {{ $labels.name }} requests were queued for 98% of time. |
Raise condition |
rate(diskio_io_time{name=~"(hd[a-z]?|sd[a-z]?|nvme[0-9]?[a-z]?[0-9]?)
"}[1m]) / 1000 > 0.98 |
Description |
I/O requests for the {{ $labels.name }} disk on the
{{ $labels.host }} node spent 98% of the device time in queue during
the last 10 minutes. |
SystemMemoryFullWarning
Severity |
Warning |
Summary |
The {{ $labels.host }} node uses {{ $value }}% of memory for 2
minutes. |
Raise condition |
mem_used_percent > 90 and mem_available < 8 * 2^30 |
Description |
Raises when a node uses more than 90% of RAM and less than 8 GB of
memory is available, indicating that the node is under a high load. The
host label in the raised alert contains the name of the affected
node.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
- Verify the free and used RAM on the affected node using
free -h .
- Identify the service that consumes RAM.
|
Tuning |
For example, to change the threshold to 80% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemMemoryFullWarning:
if: >-
mem_used_percent > 80 and mem_available < 8 * 2^30
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemMemoryFullMajor
Severity |
Major |
Summary |
The {{ $labels.host }} node uses {{ $value }}% of memory for 2
minutes. |
Raise condition |
mem_used_percent > 95 and mem_available < 4 * 2^30 |
Description |
Raises when a node uses more than 95% of RAM and less than 4 GB of
memory is available, indicating that the node is under a high load. The
host label in the raised alert contains the name of the affected
node.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
- Verify the free and used RAM on the affected node using
free -h .
- Identify the service that consumes RAM.
|
Tuning |
For example, to change the threshold to 90% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemMemoryFullMajor:
if: >-
mem_used_percent > 90 and mem_available < 4 * 2^30
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemSwapFullWarning
Removed since the 2019.2.4 maintenance update
Severity |
Warning |
Summary |
The swap on the {{ $labels.host }} node is more than 50% used for 2
minutes. |
Raise condition |
swap_used_percent >= 50.0 |
Description |
Raises when the swap on a node is 50% used, indicating that the node is
under a high load or out of RAM. The host label in the raised alert
contains the name of the affected node.
Warning
The alert has been removed starting from the 2019.2.4
maintenance update. For the existing MCP deployments, disable this
alert.
|
Troubleshooting |
- Verify the free and used RAM and swap on the affected node using
free -h .
- Identify the service that consumes RAM and swap.
|
Tuning |
Disable the alert as described in Manage alerts. |
SystemSwapFullMinor
Removed since the 2019.2.4 maintenance update
Severity |
Minor |
Summary |
The swap on the {{ $labels.host }} node is more than 90% used for 2
minutes. |
Raise condition |
swap_used_percent >= 90.0 |
Description |
Raises when the swap on a node is 90% used, indicating that the node is
under a high load or out of RAM. The host label contains the name of
the affected node.
Warning
The alert has been removed starting from the 2019.2.4
maintenance update. For the existing MCP deployments, disable this
alert.
|
Troubleshooting |
- Verify the free and used RAM and swap on the affected node using
free -h .
- Identify the service that consumes RAM and swap.
|
Tuning |
Disable the alert as described in Manage alerts. |
SystemRxPacketsDroppedTooHigh
Severity |
Warning |
Summary |
More than 60 packets received by the {{ $labels.interface }}
interface on the {{ $labels.host }} node were dropped during the
last minute. |
Raise condition |
increase(net_drop_in[1m]) > 60 unless on (host,interface)
bond_slave_active == 0 |
Description |
Raises when the number of dropped RX packets on an interface (except
the bond slave interfaces that are in the BACKUP state) is higher
than 60 for the last minute, according to the data from
/proc/net/dev of the affected node. The host and interface
labels in the raised alert contain the name of the affected node and
the affected interface on that node. The reasons can be as follows:
- Full
softnet backlog
- Wrong VLAN tags, packets received with unknown or unregistered
protocols
- IPv6 frames in case if the server is configured only for
IPv4
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Inspect the output of the ip -s a command. |
Tuning |
For example, to change the threshold to 600 packets per 1 minute:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemRxPacketsDroppedTooHigh:
if: >-
increase(net_drop_in[1m]) > 600 unless on (host,interface) \
bond_slave_active == 0
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemTxPacketsDroppedTooHigh
Severity |
Warning |
Summary |
More than 100 packets transmitted by the {{ $labels.interface }}
interface on the {{ $labels.host }} node were dropped during the
last minute. |
Raise condition |
increase(net_drop_out[1m]) > 100 |
Description |
Raises when the number of dropped TX packets on the interface is higher
than 100 for the last 1 minute, according to the data from
/proc/net/dev of the affected node. The host and interface
labels in the raised alert contain the name of the affected node and
the affected interface on that node.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Inspect the output of the ip -s a command. |
Tuning |
For example, to change the threshold to 1000 packets per 1 minute:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemTxPacketsDroppedTooHigh:
if: >-
increase(net_drop_out[1m]) > 1000
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemRxPacketsErrorTooHigh
Available starting from the 2019.2.15 maintenance update
Severity |
Warning |
Summary |
{{ net_rx_error_threshold }}{%- raw %} received packets had errors. |
Raise condition |
increase(net_err_in[1m]) > {{ net_rx_error_threshold }} |
Description |
Raises when the number of packets received by an interface on a node had
errors during the last minute.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Inspect the output of the ip -s a command. |
Tuning |
For example, to change the threshold to 600 packets per 1 minute:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemRxPacketsErrorTooHigh:
if: >-
increase(net_err_in[1m]) > 600
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemTxPacketsErrorTooHigh
Available starting from the 2019.2.15 maintenance update
Severity |
Warning |
Summary |
{{ net_tx_error_threshold }}{%- raw %} transmitted packets had
errors. |
Raise condition |
increase(net_err_out[1m]) > {{ net_tx_error_threshold }} |
Description |
Raises when a number of packets transmitted by an interface on a node
had errors during the last minute.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Inspect the output of the ip -s a command. |
Tuning |
For example, to change the threshold to 1000 packets per 1 minute:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemTxPacketsErrorTooHigh:
if: >-
increase(net_err_out[1m]) > 1000
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CronProcessDown
Severity |
Critical |
Summary |
The cron process on the {{ $labels.host }} node is down. |
Raise condition |
procstat_running{process_name="cron"} == 0 |
Description |
Raises when Telegraf cannot find running cron processes on a node.
The host label in the raised alert contains the host name of the
affected node. |
Troubleshooting |
- Verify the
cron service status using systemctl status cron .
- Inspect the Telegraf logs using
journalctl -u telegraf .
|
Tuning |
Not required |
SshdProcessDown
Severity |
Critical |
Summary |
The SSH process on the {{ $labels.host }} node is down. |
Raise condition |
procstat_running{process_name="sshd"} == 0 |
Description |
Raises when Telegraf cannot find running sshd processes on a node.
The host label in the raised alert contains the host name of the
affected node. |
Troubleshooting |
- Verify the
sshd service status using systemctl status sshd .
- Inspect the Telegraf logs using
journalctl -u telegraf .
|
Tuning |
Not required |
SshFailedLoginsTooHigh
Severity |
Warning |
Summary |
More than 5 failed SSH login attempts on the
{{ $labels.host }} node during the last 5 minutes. |
Raise condition |
increase(failed_logins_total[5m]) > 5 |
Description |
Raises when more than 5 failed logins log messages were detected in
the syslog (/var/log/syslog ) for the last 5 minutes. Fluentd parses
the syslog for the ^Invalid user regular expressions and increases
the count in case of success. The host label in the raised alert
contains the name of the affected node. |
Troubleshooting |
Inspect the syslog journal for Invalid user words using
grep . |
Tuning |
For example, to change the threshold to 50 packets per 1 hour:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SshFailedLoginsTooHigh:
if: >-
increase(failed_logins_total[1h]) > 50
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
PacketsDroppedByCpuWarning
Available starting from the 2019.2.3 maintenance update
Severity |
Warning |
Summary |
The {{ $labels.cpu }} CPU on the {{ $labels.host }} node dropped
{{ $value }} packets during the last 10 minutes. |
Raise condition |
floor(increase(nstat_packet_drop[10m])) > 0 |
Description |
Raises when the number of packets dropped by CPU due to the lack of
space in the processing queue is more than 0 for the last 10 minutes,
according to the data in column 2 of the
/proc/net/softnet_stat file. CPU starts to drop associated packets
when its queue (backlog) is full because the interface receives packets
faster than the kernel can process them. The host and cpu labels
in the raised alert contain the name of the affected node and the CPU.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Increase the backlog size by modifying the
net.core.netdev_max_backlog kernel parameter. For example:
sudo sysctl -w net.core.netdev_max_backlog=3000
For details, see kernel documentation.
|
Tuning |
For example, to change the threshold to 100 packets per 1 hour:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
PacketsDroppedByCpuWarning:
if: >-
floor(increase(nstat_packet_drop[1h])) > 100
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
PacketsDroppedByCpuMinor
Severity |
Minor |
Summary |
The {{ $labels.cpu }} CPU on the {{ $labels.host }} node dropped
{{ $value }} packets during the last 10 minutes. |
Raise condition |
floor(increase(nstat_packet_drop[10m])) > 100 |
Description |
Raises when the number of packets dropped by CPU due to the lack of
space in the processing queue is more than 100 for the last 10 minutes,
according to the data in column 2 of the
/proc/net/softnet_stat file. CPU starts to drop associated packets
when its queue (backlog) is full because the interface receives packets
faster than the kernel can process them. The host and cpu labels
in the raised alert contain the name of the affected node and the CPU.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Increase the backlog size by modifying the
net.core.netdev_max_backlog kernel parameter. For example:
sudo sysctl -w net.core.netdev_max_backlog=3000
For details, see kernel documentation.
|
Tuning |
For example, to change the threshold to 500 packets per 1 hour:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
PacketsDroppedByCpuMinor:
if: >-
floor(increase(nstat_packet_drop[1h])) > 500
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
NetdevBudgetRanOutsWarning
Severity |
Warning |
Summary |
The rate of net_rx_action loops terminations on the
{{ $labels.host }} node is {{ $value }} per second during the
last 5 minutes. |
Raise condition |
max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.1 |
Description |
Raises when the average rate of the net_rx_action loop terminations
is greater than 0.1 per second for the last 5 minutes, according to the
data in column 3 of the /proc/net/softnet_stat file. The alert
typically indicates budget consumption or reaching of the time limit.
The net_rx_action loop starts processing the packets from the memory
to which the device transferred the packets through Direct Memory Access
(DMA). Running out of budget or time can cause loop termination.
Terminations of net_rx_action indicate that the CPU does not have
enough quotas (budget or time) to proceed with all associated packets or
some drivers encounter system issues.
The host label in the raised alert contains the host name of the
affected node.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Verify the number of dropped packets by CPU.
If no dropped packets exist, increase the budget or time interval to
resolve the issue:
Increase the budget by modifying the
net.core.netdev_budget kernel parameter. For example:
sysctl -w net.core.netdev_budget=600
For details, see kernel documentation.
Increase the time interval by modifying the
net.core.netdev_budget_usecs kernel parameter. For example:
sudo sysctl -w net.core.netdev_budget_usecs=5000
For details, see kernel documentation.
|
Tuning |
For example, to change the threshold to 0.2 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
NetdevBudgetRanOutsWarning:
if: >-
max(rate(nstat_time_squeeze[5m])) without (cpu) > 0.2
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemCpuIoWaitWarning
Available starting from the 2019.2.14 maintenance update
Severity |
Warning |
Summary |
CPU waited for I/O 40% of time. |
Raise condition |
cpu_usage_iowait > 40 |
Description |
The CPU on the {{ $labels.host }} node spent 40% of time waiting for
I/O. |
SystemCpuIoWaitCritical
Available starting from the 2019.2.14 maintenance update
Severity |
Critical |
Summary |
CPU waited for I/O 50% of time. |
Raise condition |
cpu_usage_iowait > 50 |
Description |
The CPU on the {{ $labels.host }} node spent 50% of time waiting for
I/O. |
SystemCpuStealTimeWarning
Available starting from the 2019.2.6 maintenance update
Severity |
Warning |
Summary |
The CPU steal time was above 5.0% on the {{ $labels.host }} node for
5 minutes. |
Raise condition |
cpu_usage_steal > 5.0 |
Description |
Raises when a VM vCPU spends 5% of time waiting for real CPU for the
last 5 minutes, typically occurring during high load in case of CPU
shortage. Waiting for resources slows down the processes in the VM.
Warning
For production environments, configure the alert after
deployment.
|
Tuning |
For example, to change the threshold to 2% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemCpuStealTimeWarning:
if: >-
cpu_usage_steal > 2.0
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
SystemCpuStealTimeCritical
Available starting from the 2019.2.6 maintenance update
Severity |
Critical |
Summary |
The CPU steal time was above 10.0% on the {{ $labels.host }} node
for 5 minutes. |
Raise condition |
cpu_usage_steal > 10.0 |
Description |
Raises when a VM vCPU spends 10% of time waiting for real CPU for the
last 5 minutes, typically occurring during high load in case of CPU
shortage. Waiting for resources slows down the processes in the VM.
Warning
For production environments, configure the alert after
deployment.
|
Tuning |
For example, to change the threshold to 5% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
SystemCpuStealTimeCritical:
if: >-
cpu_usage_steal > 5.0
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|