This section describes the alerts for the Ceph cluster.
Severity | Minor |
---|---|
Summary | The Ceph cluster is in the WARNING state. For details, run
ceph -s . |
Raise condition | ceph_health_status == 1 |
Description | Raises according to the status reported by the Ceph cluster. |
Troubleshooting | Run the ceph -s command on any Ceph node to identify the reason and
resolve the issue depending on the output. |
Tuning | Not required |
Severity | Critical |
---|---|
Summary | The Ceph cluster is in the CRITICAL state. For details, run
ceph -s . |
Raise condition | ceph_health_status == 2 |
Description | Raises according to the status reported by the Ceph cluster. |
Troubleshooting | Run the ceph -s command on any Ceph node to identify the reason and
resolve the issue depending on the output. |
Tuning | Not required |
Severity | Minor |
---|---|
Summary | {{ $value }}% of Ceph Monitors are down. For details, run
ceph -s . |
Raise condition | count(ceph_mon_quorum_status) - sum(ceph_mon_quorum_status) > 0 |
Description | Raises if any of the Ceph Monitors in the Ceph cluster is down. |
Troubleshooting | Inspect the /var/log/ceph/ceph-mon.<hostname>.log logs on the
affected cmn node. |
Tuning | Not required |
Severity | Minor |
---|---|
Summary | {{ $value }}% of Ceph OSDs are down. For details, run
ceph osd tree . |
Raise condition | count(ceph_osd_up) - sum(ceph_osd_up) > 0 |
Description | Raises if any of the Ceph OSD nodes in the Ceph cluster is down. |
Troubleshooting | Inspect the /var/log/ceph/ceph-osd.<hostname>.log logs on the
affected osd node. |
Tuning | Not required |
Severity | Warning |
---|---|
Summary | {{ $value }} bytes of the Ceph OSD space (>=75%) is used for 3
minutes.
For details, run cephdf . |
Raise condition | ceph_cluster_total_used_bytes > ceph_cluster_total_bytes *
{{threshold}} |
Description | Raises when a Ceph OSD used space capacity exceeds the threshold of 75%. |
Troubleshooting |
|
Tuning | For example, to change the threshold to
|
Severity | Major |
---|---|
Summary | {{ $value }} bytes of the Ceph OSD space (>=85%) is used for 3
minutes. For details, run cephdf . |
Raise condition | ceph_cluster_total_used_bytes > ceph_cluster_total_bytes *
{{threshold}} |
Description | Raises when a Ceph OSD used space capacity exceeds the threshold of 85%. |
Troubleshooting |
|
Tuning | For example, to change the threshold to
|
Severity | Warning |
---|---|
Summary | The Ceph {{pool_name}} pool uses 75% of available space for 3
minutes. For details, run ceph df . |
Raise condition | ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) *
on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} >
{{threshold}} |
Description | Raises when a Ceph pool used space capacity exceeds the threshold of 75%. |
Troubleshooting |
|
Tuning | Should be tuned per pool. For example, to change the threshold to
|
Severity | Critical |
---|---|
Summary | The Ceph {{pool_name}} pool uses 85% of available space for 3
minutes. For details, run ceph df . |
Raise condition | ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) *
on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} >
{{threshold}} |
Description | Raises when a Ceph pool used space capacity exceeds the threshold of 85%. |
Troubleshooting |
|
Tuning | Should be tuned per pool. For example, to change the threshold to
|
Severity | Warning |
---|---|
Summary | Some Ceph OSDs contain more than 200 PGs. This may have a negative
impact on the cluster performance. For details, run ceph pg dump . |
Raise condition | max(ceph_osd_numpg) > 200 |
Description | Raises when the number of PGs on Ceph OSDs is higher than the default threshold of 200. |
Troubleshooting | When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs. |
Tuning | For example, to change the threshold to
|
Severity | Critical |
---|---|
Summary | Some Ceph OSDs contain more than 300 PGs. This may have a negative
impact on the cluster performance. For details, run ceph pg dump . |
Raise condition | max(ceph_osd_numpg) > 300 |
Description | Raises when the number of PGs on Ceph OSDs is bigger than the default threshold of 300. |
Troubleshooting | When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs. |
Tuning | For example, to change the threshold to
|
Note
Ceph prediction alerts have been added starting from the MCP 2019.2.3 update and should be enabled manually. For details, see Enable the Ceph Prometheus plugin.
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The IOPS on the {{ $labels.ceph_daemon }} Ceph OSD are increasing
rapidly. |
Raise condition | predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} *
86400) > {{osd_iops_limit}} |
Description | Predicts the IOPS consumption per Ceph OSD in a specified time range, 1
week by default. The Warning For production environments, configure |
Tuning | For example, to change
|
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The IOPS on the {{ $labels.ceph_daemon }} Ceph OSD are increasing
rapidly. |
Raise condition | predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} *
86400) > avg_over_time(ceph_osd_op:rate5m[1d]) * {{ iops_threshold }} |
Description | Predicts the IOPS consumption per OSD in a specified time range, 1 week
by default. The Warning For production environments, configure
|
Tuning | For example, to change
|
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The {{$labels.host}} host may run out of available RAM next week. |
Raise condition | predict_linear(mem_free{host=~"cmn.*|rgw.*|osd.*"}[{{threshold}}d],
{{threshold}} * 86400) < 0 |
Description | Predicts the exhaustion of the available RAM on Ceph nodes in a defined time range. |
Tuning | Not required |
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The {{$labels.name}} on the {{$labels.host}} host may become
unresponsive shortly. Verify the OSDs top load on the
Ceph OSD Overview Grafana dashboard. |
Raise condition | predict_linear(diskio_write_time:rate5m
{host=~"osd.*",name=~"sd[b-z]*"}[{{threshold}}d], {{threshold}} *
86400) > avg_over_time(diskio_write_time:rate5m[1d]) *
{{write_latency_threshold}} |
Description | Predicts the OSD disks responsiveness in a specified time range based on
the write latency. The Warning For production environments, configure
|
Tuning | For example, to change
|
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The {{$labels.name}} on the {{$labels.host}} host may become
unresponsive shortly. Verify the OSDs top load on the
Ceph OSD Overview Grafana dashboard. |
Raise condition | predict_linear(diskio_read_time:rate5m{host=~"osd.*",name=~"sd[b-z]*"}
[{{threshold}}d], {{threshold}} * 86400) >
avg_over_time(diskio_read_time:rate5m[1d]) * {{read_latency_threshold}} |
Description | Predicts the OSD disks responsiveness in a specified time range based on
the read latency. The Warning For production environments, configure |
Tuning | For example, to change
|
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The {{pool_name}} pool may consume more than
{{100*space_threshold}}% of the available capacity in 1 week. For
details, run ceph df and plan proper actions. |
Raise condition | predict_linear(ceph_pool_bytes_used[{{threshold}}d], {{threshold}} *
86400) * on(pool_id) group_left(name)
ceph_pool_metadata{name="{{pool_name}}"} > (ceph_pool_bytes_used +
ceph_pool_max_avail) * {{space_threshold}} * on(pool_id)
group_left(name) ceph_pool_metadata{name="{{pool_name}}"} |
Description | Predicts the exhaustion of all available capacity of a pool in a defined
time range. The Warning For production environments, configure |
Tuning | For example, to change
|
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The IOPS in the {{pool_name}} are increasing rapidly. |
Raise condition | predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} *
86400) * on(pool_id) group_left(name)
ceph_pool_metadata{name="{{pool_name}}"} > {{ iops_limit }} |
Description | Predicts the IOPS consumption per pool in a specified time range, 1 week
by default. The Warning For production environments, after deployment, set
|
Tuning | For example, to change
|
Available starting from the 2019.2.3 maintenance update
Severity | Minor |
---|---|
Summary | The IOPS in the {{pool_name}} are increasing rapidly. |
Raise condition | predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} *
86400) * on(pool_id) group_left(name)
ceph_pool_metadata{name="{{pool_name}}"} >
avg_over_time(ceph_pool_ops:rate5m[1d]) * {{ iops_threshold }} |
Description | Predicts the IOPS utilisation per pool in a specified time range, 1 week
by default. The Warning For production environments, after deployment, set
|
Tuning | For example, to change
|
Available only in the 2019.2.10 maintenance update
Severity | Critical |
---|---|
Summary | RADOS Gateway outage. |
Raise condition | max(openstack_api_check_status{name=~"radosgw.*"}) == 0 |
Description | Raises if RADOS Gateway is not accessible for all available RADOS Gateway endpoints in the OpenStack service catalog. |
Tuning | Not required |
Available only in the 2019.2.10 maintenance update
Severity | Major |
---|---|
Summary | The {{ $labels.name }} endpoint is not accessible. |
Raise condition | openstack_api_check_status{name=~"radosgw.*"} == 0 |
Description | Raises if RADOS Gateway is not accessible for the {{ $labels.name }}
endpoint. |
Tuning | Not required |