Use S.M.A.R.T. metrics for creating alert rules on bare metal clusters¶
Available since 2.27.0 (Cluster releases 17.2.0 and 16.2.0)
The StackLight telegraf-ds-smart
exporter uses the
S.M.A.R.T. plugin to
obtain detailed disk information and export it as metrics on bare metal
clusters. S.M.A.R.T. is a commonly used system across vendors with performance
data provided as attributes, whereas attribute names can be different across
vendors. Each attribute contains the following different values:
- Raw value
Actual value of the attribute for the time being. Units may not be the same across vendors.
- Current value
Health valuation where values can range from
1
to253
(1
represents the worst case and253
represents the best one). Depending on the manufacturer, a value of100
or200
will often be selected as the normal value.
- Worst value
The worst value ever observed as a current one for a particular device.
- Threshold value
Lower threshold for the current value. If the current value drops below the lower threshold, it requires attention.
The following table provides examples for alert rules based on S.M.A.R.T. metrics. These examples may not work for all clusters depending on vendor or disk types.
Caution
Before creating alert rules, manually test these expressions to verify whether they are valid for the cluster. You can also implement any other alerts based on S.M.A.R.T. metrics.
To create custom alert rules in StackLight, use the customAlerts
parameter described in Alerts configuration.
Expression |
Description |
|
Alerts when a device |
|
Indicates disk health failure. |
|
Targets any S.M.A.R.T. attribute reaching its predefined threshold, indicating a potential risk or imminent failure of the disk. Utilizing this alert might eliminate the need for more specific attribute alerts by relying on the vendor’s established thresholds, streamlining the monitoring process. Implementing inhibition rules may be necessary to manage overlaps with other alerts effectively. |
|
Is triggered when disk temperature exceeds 60°C, indicating potential overheating issues. |
|
Identifies an increase in UDMA CRC errors, indicating data transmission issues between the disk and controller. |
|
Is triggered during a noticeable increase in the rate of read errors on the disk. This is a strong indicator of issues with the disk surface or read/write heads that can affect data integrity and accessibility. |
|
Is triggered when the disk experiences an increase in attempts to spin up to its operational speed, indicating potential issues with the disk motor, bearings, or power supply, which can lead to drive failure. |
|
Is triggered during an increase in the number of disk sectors that cannot be corrected by the error correction algorithms of the drive, pointing towards serious disk surface or read/write head issues. |
|
Is triggered on a rise in sectors that are marked as pending for remapping due to read errors. Persistent increases can indicate deteriorating disk health and impending failure. |
|
Detects an upsurge in errors during the process of data transmission from the host to the disk and vice versa, highlighting potential issues in data integrity during transfer operations. |
|
Is triggered during an increase in sectors that have been reallocated due to being deemed defective. A rising count is a critical sign of ongoing wear and tear, or damage to the disk surface. |
The following table describes S.M.A.R.T. metrics provided by Stacklight that you can use for creating alert rules depending on your cluster requirements:
Metric |
Description |
|
Reports current S.M.A.R.T. attribute values with labels for detailed context. |
|
Indicates the fetching status of individual attributes. A non-zero code indicates monitoring issues. |
|
Reports raw S.M.A.R.T. attribute values with labels for detailed context. |
|
Reports S.M.A.R.T. attribute threshold values with labels for detailed context. |
|
Reports the worst recorded values of S.M.A.R.T. attributes with labels for detailed context. |
|
Counts timeouts when a drive fails to respond to a command, indicating responsiveness issues. |
|
Reflects the overall device status post-checks, where values other than
|
|
Indicates overall device health, where values other than |
The following table describes metrics from various S.M.A.R.T. attributes that
are part of the above smart_attribute*
metrics. But their value
representation can be different, such as unified units or counter information.
Also, vendors may have different attribute namings. The following metrics are
standardized across different vendors. Depending on the disk or vendor type,
a cluster may miss some of the following metrics or have extra ones.
Metric |
Description |
|
Monitors data transmission errors, where an increase suggests potential transfer issues. |
|
Counts sectors awaiting remapping due to unrecoverable errors, with decreases over time indicating successful remapping. |
|
Tracks errors occurring during disk data reads. |
|
Counts defective sectors that have been remapped, with increases indicating drive degradation. |
|
Measures the error frequency of the drive positioning mechanism, with high values indicating mechanical issues. |
|
Tracks the drive attempts to spin up to operational speed, with increases indicating mechanical issues. |
|
Reports the drive temperature in Celsius. |
|
Counts errors in data communication between the drive and host. |
|
Records total uncorrectable read/write errors. |
|
Counts sectors that cannot be corrected indicating potentially damaged sectors. |