Documentation Portal

Ceph

Ceph¶

This section describes the alerts for the Ceph cluster.

CephClusterHealthMinor
CephClusterHealthCritical
CephMonitorDownMinor
CephOsdDownMinor
CephOsdSpaceUsageWarning
CephOsdSpaceUsageMajor
CephPool{pool_name}SpaceUsageWarning
CephPool{pool_name}SpaceUsageCritical
CephOsdPgNumTooHighWarning
CephOsdPgNumTooHighCritical
CephPredictOsdIOPSthreshold
CephPredictOsdIOPSauto
CephPredictUsageRAM
CephPredictOsdWriteLatency
CephPredictOsdReadLatency
CephPredictPoolSpace
CephPredictPoolIOPSthreshold
CephPredictPoolIOPSauto

CephClusterHealthMinor¶

Severity	Minor
Summary	The Ceph cluster is in the `WARNING` state. For details, run `ceph -s`.
Raise condition	`ceph_health_status == 1`
Description	Raises according to the status reported by the Ceph cluster.
Troubleshooting	Run the `ceph -s` command on any Ceph node to identify the reason and resolve the issue depending on the output.
Tuning	Not required

CephClusterHealthCritical¶

Severity	Critical
Summary	The Ceph cluster is in the `CRITICAL` state. For details, run `ceph -s`.
Raise condition	`ceph_health_status == 2`
Description	Raises according to the status reported by the Ceph cluster.
Troubleshooting	Run the `ceph -s` command on any Ceph node to identify the reason and resolve the issue depending on the output.
Tuning	Not required

CephMonitorDownMinor¶

Severity	Minor
Summary	`{{ $value }}%` of Ceph Monitors are down. For details, run `ceph -s`.
Raise condition	`count(ceph_mon_quorum_status) - sum(ceph_mon_quorum_status) > 0`
Description	Raises if any of the Ceph Monitors in the Ceph cluster is down.
Troubleshooting	Inspect the `/var/log/ceph/ceph-mon.<hostname>.log` logs on the affected `cmn` node.
Tuning	Not required

CephOsdDownMinor¶

Severity	Minor
Summary	`{{ $value }}%` of Ceph OSDs are down. For details, run `ceph osd tree`.
Raise condition	`count(ceph_osd_up) - sum(ceph_osd_up) > 0`
Description	Raises if any of the Ceph OSD nodes in the Ceph cluster is down.
Troubleshooting	Inspect the `/var/log/ceph/ceph-osd.<hostname>.log` logs on the affected `osd` node.
Tuning	Not required

CephOsdSpaceUsageWarning¶

Severity	Warning
Summary	`{{ $value }}` bytes of the Ceph OSD space (>=75%) is used for 3 minutes. For details, run `cephdf`.
Raise condition	`ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * {{threshold}}`
Description	Raises when a Ceph OSD used space capacity exceeds the threshold of 75%.
Troubleshooting	Remove unused data from the Ceph cluster. Add more Ceph OSDs to the Ceph cluster. Adjust the warning threshold (use with caution).
Tuning	For example, to change the threshold to `80%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdSpaceUsageWarning: if: >- ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.8 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephOsdSpaceUsageMajor¶

Severity	Major
Summary	`{{ $value }}` bytes of the Ceph OSD space (>=85%) is used for 3 minutes. For details, run `cephdf`.
Raise condition	`ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * {{threshold}}`
Description	Raises when a Ceph OSD used space capacity exceeds the threshold of 85%.
Troubleshooting	Remove unused data from the Ceph cluster. Add more Ceph OSDs to the Ceph cluster. Adjust the warning threshold (use with caution).
Tuning	For example, to change the threshold to `95%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdSpaceUsageMajor: if: >- ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.95 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPool{pool_name}SpaceUsageWarning¶

Severity	Warning
Summary	The Ceph `{{pool_name}}` pool uses 75% of available space for 3 minutes. For details, run `ceph df`.
Raise condition	`ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{threshold}}`
Description	Raises when a Ceph pool used space capacity exceeds the threshold of 75%.
Troubleshooting	Add more Ceph OSDs to the Ceph cluster. Temporarily move the affected pool to the less occupied disks of the cluster.
Tuning	Should be tuned per pool. For example, to change the threshold to `80%` for pool volumes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephPoolvolumesSpaceUsageWarning: if: >- ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \ on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.8 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPool{pool_name}SpaceUsageCritical¶

Severity	Critical
Summary	The Ceph `{{pool_name}}` pool uses 85% of available space for 3 minutes. For details, run `ceph df`.
Raise condition	`ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{threshold}}`
Description	Raises when a Ceph pool used space capacity exceeds the threshold of 85%.
Troubleshooting	Add more Ceph OSDs to the Ceph cluster. Temporarily move the affected pool to the less occupied disks of the cluster.
Tuning	Should be tuned per pool. For example, to change the threshold to `90%` for pool volumes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephPoolvolumesSpaceUsageCritical: if: >- ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \ on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.9 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephOsdPgNumTooHighWarning¶

Severity	Warning
Summary	Some Ceph OSDs contain more than 200 PGs. This may have a negative impact on the cluster performance. For details, run `ceph pg dump`.
Raise condition	`max(ceph_osd_numpg) > 200`
Description	Raises when the number of PGs on Ceph OSDs is higher than the default threshold of 200.
Troubleshooting	When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs.
Tuning	For example, to change the threshold to `400` PGs: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdPgNumTooHighWarning: if: >- max(ceph_osd_numpg) > 400 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephOsdPgNumTooHighCritical¶

Severity	Critical
Summary	Some Ceph OSDs contain more than 300 PGs. This may have a negative impact on the cluster performance. For details, run `ceph pg dump`.
Raise condition	`max(ceph_osd_numpg) > 300`
Description	Raises when the number of PGs on Ceph OSDs is bigger than the default threshold of 300.
Troubleshooting	When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs.
Tuning	For example, to change the threshold to `500` PGs: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: CephOsdPgNumTooHighCritical: if: >- max(ceph_osd_numpg) > 500 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

Note

Ceph prediction alerts have been added starting from the MCP 2019.2.3 update and should be enabled manually. For details, see Enable the Ceph Prometheus plugin.

CephPredictOsdIOPSthreshold¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS on the `{{ $labels.ceph_daemon }}` Ceph OSD are increasing rapidly.
Raise condition	`predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} * 86400) > {{osd_iops_limit}}`
Description	Predicts the IOPS consumption per Ceph OSD in a specified time range, 1 week by default. The `threshold` parameter defines the time range. Warning For production environments, configure `osd_iops_limit` after deployment depending on the used hardware. For exemplary estimates for different hardware types, see IOPS.
Tuning	For example, to change `osd_iops_limit` to `200`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: osd_iops_limit: 200 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictOsdIOPSauto¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS on the `{{ $labels.ceph_daemon }}` Ceph OSD are increasing rapidly.
Raise condition	`predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} * 86400) > avg_over_time(ceph_osd_op:rate5m[1d]) * {{ iops_threshold }}`
Description	Predicts the IOPS consumption per OSD in a specified time range, 1 week by default. The `threshold` parameter defines the time range. Warning For production environments, configure `osd_iops_threshold` after deployment depending on the current cluster load and estimated limits from `CephPredictOsdIOPSthreshold`.
Tuning	For example, to change `osd_iops_threshold` to `2`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: osd_iops_threshold: 2 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictUsageRAM¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{$labels.host}}` host may run out of available RAM next week.
Raise condition	`predict_linear(mem_free{host=~"cmn.\|rgw.\|osd."}[{{threshold}}d], {{threshold}} 86400) < 0`
Description	Predicts the exhaustion of the available RAM on Ceph nodes in a defined time range.
Tuning	Not required

CephPredictOsdWriteLatency¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{$labels.name}}` on the `{{$labels.host}}` host may become unresponsive shortly. Verify the OSDs top load on the Ceph OSD Overview Grafana dashboard.
Raise condition	`predict_linear(diskio_write_time:rate5m {host=~"osd.",name=~"sd[b-z]"}[{{threshold}}d], {{threshold}} * 86400) > avg_over_time(diskio_write_time:rate5m[1d]) * {{write_latency_threshold}}`
Description	Predicts the OSD disks responsiveness in a specified time range based on the write latency. The `threshold` parameter defines the time range. The `write_latency_threshold` parameter defines the differences to detect in the write latency. Warning For production environments, configure `write_latency_threshold` after deployment.
Tuning	For example, to change `write_latency_threshold` to `2`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: write_latency_threshold: 2 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictOsdReadLatency¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{$labels.name}}` on the `{{$labels.host}}` host may become unresponsive shortly. Verify the OSDs top load on the Ceph OSD Overview Grafana dashboard.
Raise condition	`predict_linear(diskio_read_time:rate5m{host=~"osd.",name=~"sd[b-z]"} [{{threshold}}d], {{threshold}} * 86400) > avg_over_time(diskio_read_time:rate5m[1d]) * {{read_latency_threshold}}`
Description	Predicts the OSD disks responsiveness in a specified time range based on the read latency. The `threshold` parameter defines the time range. The `read_latency_threshold` parameter defines the differences to detect in the read latency. Warning For production environments, configure `read_latency_threshold` after deployment.
Tuning	For example, to change `read_latency_threshold` to `2`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: read_latency_threshold: 2 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolSpace¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The `{{pool_name}}` pool may consume more than `{{100*space_threshold}}%` of the available capacity in 1 week. For details, run `ceph df` and plan proper actions.
Raise condition	`predict_linear(ceph_pool_bytes_used[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > (ceph_pool_bytes_used + ceph_pool_max_avail) * {{space_threshold}} * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"}`
Description	Predicts the exhaustion of all available capacity of a pool in a defined time range. The `threshold` parameter specifies the time range to use. The `space_threshold` parameter defines the capacity threshold, similar to the one set in `CephPool{pool_name}SpaceUsageCritical`. Warning For production environments, configure `space_threshold` after deployment.
Tuning	For example, to change `space_threshold` to `85`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: space_threshold: 85 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolIOPSthreshold¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS in the `{{pool_name}}` are increasing rapidly.
Raise condition	`predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{ iops_limit }}`
Description	Predicts the IOPS consumption per pool in a specified time range, 1 week by default. The `threshold` parameter specifies the time range to use. Warning For production environments, after deployment, set `pool_iops_limit` to `osd_iops_limit` from `CephPredictOsdIOPSthreshold` multiplied by the number OSDs for this pool.
Tuning	For example, to change `pool_iops_limit` to `2000`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: pool_iops_limit: 2000 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolIOPSauto¶

^{Available starting from the 2019.2.3 maintenance update}

Severity	Minor
Summary	The IOPS in the `{{pool_name}}` are increasing rapidly.
Raise condition	`predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > avg_over_time(ceph_pool_ops:rate5m[1d]) * {{ iops_threshold }}`
Description	Predicts the IOPS utilisation per pool in a specified time range, 1 week by default. The `threshold` parameter specifies the time range to use. Warning For production environments, after deployment, set `pool_iops_threshold` to `iops_limit` from `CephPredictOsdIOPSAuto` muliplied by the number of OSDs connected to each pool.
Tuning	For example, to change `pool_iops_threshold` to `3`: On the cluster level of the Reclass model in the `cluster/<cluster_name>/ceph/common.yml` file, add: parameters: _param: pool_iops_threshold: 3 From the Salt Master node, apply the changes: salt "I@prometheus:server" state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RadosGWOutage¶

^{Available only in the 2019.2.10 maintenance update}

Severity	Critical
Summary	RADOS Gateway outage.
Raise condition	`max(openstack_api_check_status{name=~"radosgw.*"}) == 0`
Description	Raises if RADOS Gateway is not accessible for all available RADOS Gateway endpoints in the OpenStack service catalog.
Tuning	Not required

RadosGWDown¶

^{Available only in the 2019.2.10 maintenance update}

Severity	Major
Summary	The `{{ $labels.name }}` endpoint is not accessible.
Raise condition	`openstack_api_check_status{name=~"radosgw.*"} == 0`
Description	Raises if RADOS Gateway is not accessible for the `{{ $labels.name }}` endpoint.
Tuning	Not required

updated: 2025-01-10 08:56

ZooKeeper

View Previous Section

StackLight LMA

View Next Section