Troubleshoot Ubuntu kernel alerts

This section describes the investigation and troubleshooting steps for the Ubuntu kernel alerts.


KernelIOErrorsDetected

Available since 2.27.0 (Cluster releases 17.2.0 and 16.2.0)

Root cause

Kernel logs generated IO error logs, potentially indicating disk issues. IO errors may occur due to various reasons and are often unpredictable.

Investigation

Inspect kernel logs on affected nodes for IO errors to pinpoint the issue, identify the affected disk, if any, and assess its condition. Most major Linux distributions store kernel logs in /var/log/dmesg and occasionally in /var/log/kern.log.

If the issue is not related to a faulty disk, further inspect errors in logs to identify the root cause.

Mitigation

Mitigation steps depend on the identified issue. If the issue is caused by a faulty disk, replace the affected disk. Additionally, consider the following measures to prevent such issues in the future:

  • Implement proactive monitoring of disk health to detect early signs of failure and initiate replacements preemptively.

  • Utilize tools such as smartctl or nvme` for routine collection of disk metrics, enabling prediction of failures and early identification of underperforming disks to prevent major disruptions.