Use S.M.A.R.T. metrics for creating alert rules on bare metal clusters

Available since 2.27.0 (Cluster releases 17.2.0 and 16.2.0)

The StackLight telegraf-ds-smart exporter uses the S.M.A.R.T. plugin to obtain detailed disk information and export it as metrics on bare metal clusters. S.M.A.R.T. is a commonly used system across vendors with performance data provided as attributes, whereas attribute names can be different across vendors. Each attribute contains the following different values:

  • Raw value

    Actual value of the attribute for the time being. Units may not be the same across vendors.

  • Current value

    Health valuation where values can range from 1 to 253 (1 represents the worst case and 253 represents the best one). Depending on the manufacturer, a value of 100 or 200 will often be selected as the normal value.

  • Worst value

    The worst value ever observed as a current one for a particular device.

  • Threshold value

    Lower threshold for the current value. If the current value drops below the lower threshold, it requires attention.

The following table provides examples for alert rules based on S.M.A.R.T. metrics. These examples may not work for all clusters depending on vendor or disk types.

Caution

Before creating alert rules, manually test these expressions to verify whether they are valid for the cluster. You can also implement any other alerts based on S.M.A.R.T. metrics.

To create custom alert rules in StackLight, use the customAlerts parameter described in Alerts configuration.

Expression

Description

expr: smart_device_exit_status > 0

Alerts when a device exit status signals potential issues.

expr: smart_device_health_ok == 0

Indicates disk health failure.

expr: smart_attribute_threshold >= smart_attribute

Targets any S.M.A.R.T. attribute reaching its predefined threshold, indicating a potential risk or imminent failure of the disk. Utilizing this alert might eliminate the need for more specific attribute alerts by relying on the vendor’s established thresholds, streamlining the monitoring process. Implementing inhibition rules may be necessary to manage overlaps with other alerts effectively.

expr: smart_device_temp_c > 60

Is triggered when disk temperature exceeds 60°C, indicating potential overheating issues.

expr: increase(smart_device_udma_crc_errors[2m]) > 0

Identifies an increase in UDMA CRC errors, indicating data transmission issues between the disk and controller.

expr: increase(smart_device_read_error_rate[2m]) > 0

Is triggered during a noticeable increase in the rate of read errors on the disk. This is a strong indicator of issues with the disk surface or read/write heads that can affect data integrity and accessibility.

expr: increase(smart_device_spin_retry_count[2m]) > 0

Is triggered when the disk experiences an increase in attempts to spin up to its operational speed, indicating potential issues with the disk motor, bearings, or power supply, which can lead to drive failure.

expr: increase(smart_device_uncorrectable_sector_count[2m]) > 0

Is triggered during an increase in the number of disk sectors that cannot be corrected by the error correction algorithms of the drive, pointing towards serious disk surface or read/write head issues.

expr: increase(smart_device_pending_sector_count[2m]) > 0

Is triggered on a rise in sectors that are marked as pending for remapping due to read errors. Persistent increases can indicate deteriorating disk health and impending failure.

expr: increase(smart_device_end_to_end_error[2m]) > 0

Detects an upsurge in errors during the process of data transmission from the host to the disk and vice versa, highlighting potential issues in data integrity during transfer operations.

expr: increase(smart_device_reallocated_sectors_count[2m]) > 0

Is triggered during an increase in sectors that have been reallocated due to being deemed defective. A rising count is a critical sign of ongoing wear and tear, or damage to the disk surface.

The following table describes S.M.A.R.T. metrics provided by Stacklight that you can use for creating alert rules depending on your cluster requirements:

Metric

Description

smart_attribute

Reports current S.M.A.R.T. attribute values with labels for detailed context.

smart_attribute_exit_status

Indicates the fetching status of individual attributes. A non-zero code indicates monitoring issues.

smart_attribute_raw_value

Reports raw S.M.A.R.T. attribute values with labels for detailed context.

smart_attribute_threshold

Reports S.M.A.R.T. attribute threshold values with labels for detailed context.

smart_attribute_worst

Reports the worst recorded values of S.M.A.R.T. attributes with labels for detailed context.

smart_device_command_timeout

Counts timeouts when a drive fails to respond to a command, indicating responsiveness issues.

smart_device_exit_status

Reflects the overall device status post-checks, where values other than 0 indicate issues.

smart_device_health_ok

Indicates overall device health, where values other than 1 indicate issues. Relates to the --health attribute of the smartctl tool.

The following table describes metrics from various S.M.A.R.T. attributes that are part of the above smart_attribute* metrics. But their value representation can be different, such as unified units or counter information. Also, vendors may have different attribute namings. The following metrics are standardized across different vendors. Depending on the disk or vendor type, a cluster may miss some of the following metrics or have extra ones.

Metric

Description

smart_device_end_to_end_error

Monitors data transmission errors, where an increase suggests potential transfer issues.

smart_device_pending_sector_count

Counts sectors awaiting remapping due to unrecoverable errors, with decreases over time indicating successful remapping.

smart_device_read_error_rate

Tracks errors occurring during disk data reads.

smart_device_reallocated_sectors_count

Counts defective sectors that have been remapped, with increases indicating drive degradation.

smart_device_seek_error_rate

Measures the error frequency of the drive positioning mechanism, with high values indicating mechanical issues.

smart_device_spin_retry_count

Tracks the drive attempts to spin up to operational speed, with increases indicating mechanical issues.

smart_device_temp_c

Reports the drive temperature in Celsius.

smart_device_udma_crc_errors

Counts errors in data communication between the drive and host.

smart_device_uncorrectable_errors

Records total uncorrectable read/write errors.

smart_device_uncorrectable_sector_count

Counts sectors that cannot be corrected indicating potentially damaged sectors.