Ceph

Ceph

This section describes the alerts for the Ceph cluster.


CephClusterHealthMinor

Severity

Minor

Summary

Ceph cluster health is WARNING.

Description

The Ceph cluster is in the WARNING state. For details, run ceph -s.


CephClusterHealthCritical

Severity

Critical

Summary

Ceph cluster health is CRITICAL.

Description

The Ceph cluster is in the CRITICAL state. For details, run ceph -s.


CephMonQuorumAtRisk

Severity

Major

Summary

Storage quorum is at risk.

Description

The storage cluster quorum is low.


CephOSDDownMinor

Severity

Minor

Summary

Ceph OSDs are down.

Description

{{ $value }} of Ceph OSDs in the Ceph cluster are down. For details, run ceph osd tree.


CephOSDDiskNotResponding

Severity

Critical

Summary

Disk is not responding.

Description

The {{ $labels.device }} disk device is not responding on the {{ $labels.host }} host.


CephOSDDiskUnavailable

Severity

Critical

Summary

Disk is not accessible.

Description

The {{ $labels.device }} disk device is not accessible on the {{ $labels.host }} host.


CephClusterNearFull

Severity

Warning

Summary

Storage cluster is nearly full. Expansion is required.

Description

The storage cluster utilization has crossed 85%.


CephClusterCriticallyFull

Severity

Critical

Summary

Storage cluster is critically full and needs immediate expansion.

Description

The storage cluster utilization has crossed 95%.


CephOSDPgNumTooHighWarning

Severity

Warning

Summary

Some Ceph OSDs have more than 200 PGs.

Description

Some Ceph OSDs contain more than 200 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephOSDPgNumTooHighCritical

Severity

Critical

Summary

Some Ceph OSDs have more than 300 PGs.

Description

Some Ceph OSDs contain more than 300 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephMonHighNumberOfLeaderChanges

Severity

Warning

Summary

Many leader changes occur in the storage cluster.

Description

{{ $value }} leader changes per minute occur for the {{ $labels.instance }} instance of the {{ $labels.job }} Ceph Monitor.


CephNodeDown

Severity

Critical

Summary

Ceph node {{ $labels.node }} went down.

Description

The {{ $labels.node }} Ceph node is down and requires immediate verification.


CephDataRecoveryTakingTooLong

Severity

Warning

Summary

Data recovery is slow.

Description

Data recovery has been active for more than two hours.


CephPGRepairTakingTooLong

Severity

Warning

Summary

Self-heal issues detected.

Description

The self-heal operations take an excessive amount of time.


CephOSDVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph OSD components are running.


CephMonVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph Monitor components are running.