Ceph

This section describes the alerts for the Ceph cluster.


CephClusterHealthWarning

Severity

Warning

Summary

Ceph cluster health is WARNING.

Description

The Ceph cluster is in the WARNING state. For details, run ceph -s.

CephClusterHealthCritical

Severity

Critical

Summary

Ceph cluster health is CRITICAL.

Description

The Ceph cluster is in the CRITICAL state. For details, run ceph -s.

CephClusterTargetDown

Severity

Critical

Summary

Ceph cluster Prometheus target is down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod on the {{ $labels.node }} node.

CephDaemonSlowOps

Available since 15.0.0 and 14.0.0

Severity

Warning

Summary

{{ $labels.ceph_daemon }} operations are slow.

Description

{{ $labels.ceph_daemon  }} operations take too long to process on the Ceph cluster (complaint time exceeded).

CephMonClockSkew

Available since 15.0.0 and 14.0.0

Severity

Warning

Summary

Ceph Monitor clock skew detected.

Description

Ceph Monitor clock drift exceeds configured threshold on the Ceph cluster.

CephMonQuorumAtRisk

Severity

Major

Summary

Ceph cluster quorum at risk.

Description

The Ceph Monitors quorum on the Ceph cluster is low.

CephOSDDown

Removed in 17.0.0, 16.0.0, and 14.1.0

Severity

Critical

Summary

Ceph OSDs are down.

Description

{{ $value }} Ceph OSDs on the {{ $labels.rook_cluster }} cluster are down. For details, run ceph osd tree.

CephOSDFlapping

Available since 15.0.0 and 14.0.0

Severity

Warning

Summary

Ceph OSDs flap due to network issues.

Description

The Ceph OSD {{ $labels.ceph_daemon }} on the Ceph cluster cluster changed between up and down state {{ $value | humanize }} times for 5 minutes.

CephOSDDiskNotResponding

Severity

Critical

Summary

Disk not responding.

Description

The {{ $labels.device }} disk device is not responding to {{ $labels.ceph_daemon }} on the {{ $labels.node }} node of the Ceph cluster.

CephOSDDiskUnavailable

Severity

Critical

Summary

Disk not accessible.

Description

The {{ $labels.device }} disk device is not accessible by {{ $labels.ceph_daemon }} on the {{ $labels.node }} node of the Ceph cluster.

CephOSDSlowClusterNetwork

Available since 15.0.0 and 14.0.0

Severity

Warning

Summary

Cluster network slows down Ceph OSD heartbeats.

Description

Ceph OSD heartbeats on the cluster network (back end) of the cluster are slow.

CephOSDSlowPublicNetwork

Available since 15.0.0 and 14.0.0

Severity

Warning

Summary

Public network slows down Ceph OSD heartbeats.

Description

Ceph OSD heartbeats on the public network (front end) are running slow.

CephClusterFullWarning

Severity

Warning

Summary

Ceph cluster is nearly full.

Description

The Ceph cluster utilization has crossed 85%. Expansion is required.

CephClusterFullCritical

Severity

Critical

Summary

Ceph cluster is full.

Description

The Ceph cluster utilization has crossed 95% and needs immediate expansion.

CephOSDPgNumTooHighWarning

Severity

Warning

Summary

Ceph OSDs have more than 200 PGs.

Description

Some Ceph OSDs contain more than 200 Placement Groups. This may have a negative impact on the cluster performance. For details, run ceph pg dump.

CephOSDPgNumTooHighCritical

Severity

Critical

Summary

Ceph OSDs have more than 300 PGs.

Description

Some Ceph OSDs contain more than 300 Placement Groups. This may have a negative impact on the cluster performance. For details, run ceph pg dump.

CephMonHighNumberOfLeaderChanges

Severity

Major

Summary

Ceph cluster has too many leader changes.

Description

The Ceph Monitor {{ $labels.ceph_daemon }} on the Ceph cluster has detected {{ $value }} leader changes per minute.

CephOSDNodeDown

Since 17.0.0, 16.0.0, and 14.1.0 to replace CephNodeDown

Severity

Critical

Summary

Ceph node {{ $labels.node }} went down.

Description

The Ceph OSD node {{ $labels.node }} of the Ceph cluster went down and requires immediate verification.

CephNodeDown

Renamed to CephOSDNodeDown in 17.0.0, 16.0.0, and 14.1.0

Severity

Critical

Summary

Ceph node {{ $labels.node }} went down.

Description

The Ceph node {{ $labels.node }} of the {{ $labels.rook_cluster }} cluster went down and requires immediate verification.

CephOSDVersionMismatch

Severity

Warning

Summary

Multiple versions of Ceph OSDs running.

Description

{{ $value }} different versions of Ceph OSD daemons are running on the cluster.

CephMonVersionMismatch

Severity

Warning

Summary

Multiple versions of Ceph Monitors running.

Description

{{ $value }} different versions of Ceph Monitors are running on the Ceph cluster.

CephPGInconsistent

Severity

Warning

Summary

Too many inconsistent Ceph PGs.

Description

The Ceph cluster detects inconsistencies in one or more replicas of an object in {{ $value }} Placement Groups on the {{ $labels.name }} pool.

CephPGUndersized

Severity

Warning

Summary

Too many undersized Ceph PGs.

Description

The Ceph cluster reports {{ $value }} Placement Groups have fewer copies than the configured pool replication level on the {{ $labels.name }} pool.