Documentation Portal

SMART disks

SMART disks¶

This section describes the alerts for SMART disks.

Warning

SMART disks monitoring is available starting from the MCP 2019.2.3 update. For the existing MCP deployments, manually enable SMART disks monitoring as described in Enable SMART disk monitoring.

Warning

SMART disks alerts have been removed starting from the MCP 2019.2.9 update.

SystemSMARTDiskUDMACrcErrorsTooHigh
SystemSMARTDiskHealthStatus
SystemSMARTDiskReadErrorRate
SystemSMARTDiskSeekErrorRate
SystemSMARTDiskTemperatureHigh
SystemSMARTDiskReallocatedSectorsCount
SystemSMARTDiskCurrentPendingSectors
SystemSMARTDiskReportedUncorrectableErrors
SystemSMARTDiskOfflineUncorrectableSectors
SystemSMARTDiskEndToEndError

SystemSMARTDiskUDMACrcErrorsTooHigh¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting UDMA CRC errors for 5 minutes.
Raise condition	`increase(smart_device_udma_crc_errors[1m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `SMART UDMA CRC` error messages on a host every minute for 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskUDMACrcErrorsTooHigh: if: >- increase(smart_device_udma_crc_errors[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskHealthStatus¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting bad health status for 1 minute.
Raise condition	`smart_device_health_ok == 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects bad health status of a SMART disk on a host for 1 minute. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	Not required

SystemSMARTDiskReadErrorRate¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting an increased read error rate for 5 minutes.
Raise condition	`increase(smart_device_read_error_rate[1m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `ReadErrorRate` messages on a host every minute for the last 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskReadErrorRate: if: >- increase(smart_device_read_error_rate[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskSeekErrorRate¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node is reporting an increased seek error rate for 5 minutes.
Raise condition	`increase(smart_device_seek_error_rate[1m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `SeekErrorRate` messages on a host every minute for the last 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	To change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskSeekErrorRate: if: >- increase(smart_device_seek_error_rate[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskTemperatureHigh¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Warning
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has a temperature of `{{ $value }}C` for 5 minutes.
Raise condition	`smart_device_temp_c >= 60`
Description	Raises when the SMART Telegraf input plugin detects that the SMART disk temperature on a host is above the default threshold of 60°C for the last 5 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to `>=40C degrees`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskTemperatureHigh: if: >- smart_device_temp_c >= 40 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskReallocatedSectorsCount¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has reallocated `{{ $value }}` sectors.
Raise condition	`increase(smart_attribute_raw_value{name="Reallocated_Sector_Ct"}[10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `ReallocatedSectorsCount` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskReallocatedSectorsCount: if: >- increase(smart_attribute_raw_value\ {name="Reallocated_Sector_Ct"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskCurrentPendingSectors¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `current pending` sectors.
Raise condition	`increase(smart_attribute_raw_value{name="Current_Pending_Sector"} [10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `CurrentPendingSectors` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskCurrentPendingSectors: if: >- increase(smart_attribute_raw_value\ {name="Current_Pending_Sector"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskReportedUncorrectableErrors¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `reported uncorrectable` errors.
Raise condition	`increase(smart_attribute_raw_value{name="Reported_Uncorrect"}[10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `ReportedUncorrectableErrors` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskReportedUncorrectableErrors: if: >- increase(smart_attribute_raw_value\ {name="Reported_Uncorrect"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskOfflineUncorrectableSectors¶

^{Available starting from the 2019.2.5 maintenance update}

Severity	Major
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `offline uncorrectable` sectors.
Raise condition	`smart_attribute_raw_value{name="Offline_Uncorrectable"} > 0`
Description	Raises when the SMART input plugin detects that the `Offline_Uncorrectable` parameter is not `0`. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	Not required

SystemSMARTDiskEndToEndError¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Major ^{Minor in 2019.2.4}
Summary	The `{{ $labels.device }}` disk on the `{{ $labels.host }}` node has `{{ $value }}` `end-to-end` errors.
Raise condition	`increase(smart_attribute_raw_value{name="End-to-End_Error"}[10m]) > 0`
Description	Raises when the SMART Telegraf input plugin (using `smartmontools`) detects the `EndToEndError` messages every 10 minutes. The `host` and `device` labels in the raised alert contain the name of the affected node and the affected device.
Troubleshooting	Inspect the disk SMART data using the `smartctl` command.
Tuning	For example, to change the threshold to 5 errors during 5 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: SystemSMARTDiskEndToEndError: if: >- increase(smart_attribute_raw_value\ {name="End-to-End_Error"}[5m]) > 5 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

updated: 2025-01-10 08:56

Salt

View Previous Section

SSL certificates

View Next Section