SMART disks

SMART disks

This section describes the alerts for SMART disks.

Warning

SMART disks monitoring is available starting from the MCP 2019.2.3 update. For the existing MCP deployments, manually enable SMART disks monitoring as described in Enable SMART disk monitoring.

Warning

SMART disks alerts have been removed starting from the MCP 2019.2.9 update.


SystemSMARTDiskUDMACrcErrorsTooHigh

Available starting from the 2019.2.3 maintenance update

Severity

Warning

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting UDMA CRC errors for 5 minutes.

Raise condition

increase(smart_device_udma_crc_errors[1m]) > 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects the SMART UDMA CRC error messages on a host every minute for 5 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

For example, to change the threshold to 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskUDMACrcErrorsTooHigh:
              if: >-
                increase(smart_device_udma_crc_errors[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskHealthStatus

Available starting from the 2019.2.3 maintenance update

Severity

Warning

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting bad health status for 1 minute.

Raise condition

smart_device_health_ok == 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects bad health status of a SMART disk on a host for 1 minute. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

Not required

SystemSMARTDiskReadErrorRate

Available starting from the 2019.2.3 maintenance update

Severity

Warning

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased read error rate for 5 minutes.

Raise condition

increase(smart_device_read_error_rate[1m]) > 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects the ReadErrorRate messages on a host every minute for the last 5 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

For example, to change the threshold to 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskReadErrorRate:
              if: >-
                increase(smart_device_read_error_rate[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskSeekErrorRate

Available starting from the 2019.2.3 maintenance update

Severity

Warning

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased seek error rate for 5 minutes.

Raise condition

increase(smart_device_seek_error_rate[1m]) > 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects the SeekErrorRate messages on a host every minute for the last 5 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

To change the threshold to 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskSeekErrorRate:
              if: >-
               increase(smart_device_seek_error_rate[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskTemperatureHigh

Available starting from the 2019.2.3 maintenance update

Severity

Warning

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node has a temperature of {{ $value }}C for 5 minutes.

Raise condition

smart_device_temp_c >= 60

Description

Raises when the SMART Telegraf input plugin detects that the SMART disk temperature on a host is above the default threshold of 60°C for the last 5 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

For example, to change the threshold to >=40C degrees:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskTemperatureHigh:
              if: >-
                smart_device_temp_c >= 40
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskReallocatedSectorsCount

Available starting from the 2019.2.3 maintenance update

Severity

Major Minor in 2019.2.4

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node has reallocated {{ $value }} sectors.

Raise condition

increase(smart_attribute_raw_value{name="Reallocated_Sector_Ct"}[10m]) > 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects the ReallocatedSectorsCount messages every 10 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

For example, to change the threshold 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskReallocatedSectorsCount:
              if: >-
               increase(smart_attribute_raw_value\
               {name="Reallocated_Sector_Ct"}[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskCurrentPendingSectors

Available starting from the 2019.2.3 maintenance update

Severity

Major Minor in 2019.2.4

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} current pending sectors.

Raise condition

increase(smart_attribute_raw_value{name="Current_Pending_Sector"} [10m]) > 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects the CurrentPendingSectors messages every 10 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

For example, to change the threshold to 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskCurrentPendingSectors:
              if: >-
               increase(smart_attribute_raw_value\
               {name="Current_Pending_Sector"}[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskReportedUncorrectableErrors

Available starting from the 2019.2.3 maintenance update

Severity

Major Minor in 2019.2.4

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} reported uncorrectable errors.

Raise condition

increase(smart_attribute_raw_value{name="Reported_Uncorrect"}[10m]) > 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects the ReportedUncorrectableErrors messages every 10 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

For example, to change the threshold to 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskReportedUncorrectableErrors:
              if: >-
               increase(smart_attribute_raw_value\
               {name="Reported_Uncorrect"}[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

SystemSMARTDiskOfflineUncorrectableSectors

Available starting from the 2019.2.5 maintenance update

Severity

Major

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} offline uncorrectable sectors.

Raise condition

smart_attribute_raw_value{name="Offline_Uncorrectable"} > 0

Description

Raises when the SMART input plugin detects that the Offline_Uncorrectable parameter is not 0. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

Not required

SystemSMARTDiskEndToEndError

Available starting from the 2019.2.3 maintenance update

Severity

Major Minor in 2019.2.4

Summary

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} end-to-end errors.

Raise condition

increase(smart_attribute_raw_value{name="End-to-End_Error"}[10m]) > 0

Description

Raises when the SMART Telegraf input plugin (using smartmontools) detects the EndToEndError messages every 10 minutes. The host and device labels in the raised alert contain the name of the affected node and the affected device.

Troubleshooting

Inspect the disk SMART data using the smartctl command.

Tuning

For example, to change the threshold to 5 errors during 5 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            SystemSMARTDiskEndToEndError:
              if: >-
               increase(smart_attribute_raw_value\
               {name="End-to-End_Error"}[5m]) > 5
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.