This section describes the alerts for SMART disks.
Warning
SMART disks monitoring is available starting from the MCP 2019.2.3 update. For the existing MCP deployments, manually enable SMART disks monitoring as described in Enable SMART disk monitoring.
Warning
SMART disks alerts have been removed starting from the MCP 2019.2.9 update.
Available starting from the 2019.2.3 maintenance update
Severity | Warning |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node is
reporting UDMA CRC errors for 5 minutes. |
Raise condition | increase(smart_device_udma_crc_errors[1m]) > 0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects the SMART UDMA CRC error messages on a host every minute
for 5 minutes. The host and device labels in the raised
alert contain the name of the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | For example, to change the threshold to 5 errors during 5 minutes:
|
Available starting from the 2019.2.3 maintenance update
Severity | Warning |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node is
reporting bad health status for 1 minute. |
Raise condition | smart_device_health_ok == 0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects bad health status of a SMART disk on a host for 1 minute. The
host and device labels in the raised alert contain the name of
the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | Not required |
Available starting from the 2019.2.3 maintenance update
Severity | Warning |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node is
reporting an increased read error rate for 5 minutes. |
Raise condition | increase(smart_device_read_error_rate[1m]) > 0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects the ReadErrorRate messages on a host every minute for the
last 5 minutes. The host and device labels in the raised alert
contain the name of the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | For example, to change the threshold to 5 errors during 5 minutes:
|
Available starting from the 2019.2.3 maintenance update
Severity | Warning |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node is
reporting an increased seek error rate for 5 minutes. |
Raise condition | increase(smart_device_seek_error_rate[1m]) > 0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects the SeekErrorRate messages on a host every minute for the
last 5 minutes. The host and device labels in the raised alert
contain the name of the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | To change the threshold to 5 errors during 5 minutes:
|
Available starting from the 2019.2.3 maintenance update
Severity | Warning |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node has
a temperature of {{ $value }}C for 5 minutes. |
Raise condition | smart_device_temp_c >= 60 |
Description | Raises when the SMART Telegraf input plugin detects that the SMART disk
temperature on a host is above the default threshold of 60°C for the
last 5 minutes. The host and device labels in the raised alert
contain the name of the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | For example, to change the threshold to
|
Available starting from the 2019.2.3 maintenance update
Severity | Major Minor in 2019.2.4 |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node has
reallocated {{ $value }} sectors. |
Raise condition | increase(smart_attribute_raw_value{name="Reallocated_Sector_Ct"}[10m])
> 0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects the ReallocatedSectorsCount messages every 10 minutes. The
host and device labels in the raised alert contain the name of
the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | For example, to change the threshold 5 errors during 5 minutes:
|
Available starting from the 2019.2.3 maintenance update
Severity | Major Minor in 2019.2.4 |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node has
{{ $value }} current pending sectors. |
Raise condition | increase(smart_attribute_raw_value{name="Current_Pending_Sector"}
[10m]) > 0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects the CurrentPendingSectors messages every 10 minutes. The
host and device labels in the raised alert contain the name of
the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | For example, to change the threshold to 5 errors during 5 minutes:
|
Available starting from the 2019.2.3 maintenance update
Severity | Major Minor in 2019.2.4 |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node has
{{ $value }} reported uncorrectable errors. |
Raise condition | increase(smart_attribute_raw_value{name="Reported_Uncorrect"}[10m]) >
0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects the ReportedUncorrectableErrors messages every 10 minutes.
The host and device labels in the raised alert contain the name
of the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | For example, to change the threshold to 5 errors during 5 minutes:
|
Available starting from the 2019.2.5 maintenance update
Severity | Major |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node has
{{ $value }} offline uncorrectable sectors. |
Raise condition | smart_attribute_raw_value{name="Offline_Uncorrectable"} > 0 |
Description | Raises when the SMART input plugin detects that the
Offline_Uncorrectable parameter is not 0 . The host and
device labels in the raised alert contain the name of the affected
node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | Not required |
Available starting from the 2019.2.3 maintenance update
Severity | Major Minor in 2019.2.4 |
---|---|
Summary | The {{ $labels.device }} disk on the {{ $labels.host }} node has
{{ $value }} end-to-end errors. |
Raise condition | increase(smart_attribute_raw_value{name="End-to-End_Error"}[10m]) > 0 |
Description | Raises when the SMART Telegraf input plugin (using smartmontools )
detects the EndToEndError messages every 10 minutes. The
host and device labels in the raised alert contain the name of
the affected node and the affected device. |
Troubleshooting | Inspect the disk SMART data using the smartctl command. |
Tuning | For example, to change the threshold to 5 errors during 5 minutes:
|