Ceph
This section describes the alerts for the Ceph cluster.
CephClusterHealthMinor
Severity |
Minor |
Summary |
The Ceph cluster is in the WARNING state. For details, run
ceph -s . |
Raise condition |
ceph_health_status == 1
|
Description |
Raises according to the status reported by the Ceph cluster. |
Troubleshooting |
Run the ceph -s command on any Ceph node to identify the reason and
resolve the issue depending on the output. |
Tuning |
Not required |
CephClusterHealthCritical
Severity |
Critical |
Summary |
The Ceph cluster is in the CRITICAL state. For details, run
ceph -s . |
Raise condition |
ceph_health_status == 2
|
Description |
Raises according to the status reported by the Ceph cluster. |
Troubleshooting |
Run the ceph -s command on any Ceph node to identify the reason and
resolve the issue depending on the output. |
Tuning |
Not required |
CephMonitorDownMinor
Severity |
Minor |
Summary |
{{ $value }}% of Ceph Monitors are down. For details, run
ceph -s .
|
Raise condition |
count(ceph_mon_quorum_status) - sum(ceph_mon_quorum_status) > 0
|
Description |
Raises if any of the Ceph Monitors in the Ceph cluster is down. |
Troubleshooting |
Inspect the /var/log/ceph/ceph-mon.<hostname>.log logs on the
affected cmn node. |
Tuning |
Not required |
CephOsdDownMinor
Severity |
Minor |
Summary |
{{ $value }}% of Ceph OSDs are down. For details, run
ceph osd tree .
|
Raise condition |
count(ceph_osd_up) - sum(ceph_osd_up) > 0
|
Description |
Raises if any of the Ceph OSD nodes in the Ceph cluster is down. |
Troubleshooting |
Inspect the /var/log/ceph/ceph-osd.<hostname>.log logs on the
affected osd node. |
Tuning |
Not required |
CephOsdSpaceUsageWarning
Severity |
Warning |
Summary |
{{ $value }} bytes of the Ceph OSD space (>=75%) is used for 3
minutes.
For details, run cephdf .
|
Raise condition |
ceph_cluster_total_used_bytes > ceph_cluster_total_bytes *
{{threshold}}
|
Description |
Raises when a Ceph OSD used space capacity exceeds the threshold of
75%. |
Troubleshooting |
Remove unused data from the Ceph cluster.
Add more Ceph OSDs to the Ceph cluster.
Adjust the warning threshold (use with caution).
|
Tuning |
For example, to change the threshold to 80% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
CephOsdSpaceUsageWarning:
if: >-
ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.8
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephOsdSpaceUsageMajor
Severity |
Major |
Summary |
{{ $value }} bytes of the Ceph OSD space (>=85%) is used for 3
minutes. For details, run cephdf .
|
Raise condition |
ceph_cluster_total_used_bytes > ceph_cluster_total_bytes *
{{threshold}}
|
Description |
Raises when a Ceph OSD used space capacity exceeds the threshold of
85%. |
Troubleshooting |
Remove unused data from the Ceph cluster.
Add more Ceph OSDs to the Ceph cluster.
Adjust the warning threshold (use with caution).
|
Tuning |
For example, to change the threshold to 95% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
CephOsdSpaceUsageMajor:
if: >-
ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.95
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPool{pool_name}SpaceUsageWarning
Severity |
Warning |
Summary |
The Ceph {{pool_name}} pool uses 75% of available space for 3
minutes. For details, run ceph df . |
Raise condition |
ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) *
on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} >
{{threshold}}
|
Description |
Raises when a Ceph pool used space capacity exceeds the threshold of
75%. |
Troubleshooting |
|
Tuning |
Should be tuned per pool. For example, to change the threshold to
80% for pool volumes:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
CephPoolvolumesSpaceUsageWarning:
if: >-
ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \
on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.8
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPool{pool_name}SpaceUsageCritical
Severity |
Critical |
Summary |
The Ceph {{pool_name}} pool uses 85% of available space for 3
minutes. For details, run ceph df . |
Raise condition |
ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) *
on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} >
{{threshold}}
|
Description |
Raises when a Ceph pool used space capacity exceeds the threshold of
85%. |
Troubleshooting |
|
Tuning |
Should be tuned per pool. For example, to change the threshold to
90% for pool volumes:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
CephPoolvolumesSpaceUsageCritical:
if: >-
ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \
on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.9
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephOsdPgNumTooHighWarning
Severity |
Warning |
Summary |
Some Ceph OSDs contain more than 200 PGs. This may have a negative
impact on the cluster performance. For details, run ceph pg dump . |
Raise condition |
max(ceph_osd_numpg) > 200
|
Description |
Raises when the number of PGs on Ceph OSDs is higher than the default
threshold of 200. |
Troubleshooting |
When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to
400 PGs if SSD disks are used. For a majority of deployments that use
modern hardware, it is safe to keep approximately 300 PGs. |
Tuning |
For example, to change the threshold to 400 PGs:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
CephOsdPgNumTooHighWarning:
if: >-
max(ceph_osd_numpg) > 400
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephOsdPgNumTooHighCritical
Severity |
Critical |
Summary |
Some Ceph OSDs contain more than 300 PGs. This may have a negative
impact on the cluster performance. For details, run ceph pg dump . |
Raise condition |
max(ceph_osd_numpg) > 300
|
Description |
Raises when the number of PGs on Ceph OSDs is bigger than the default
threshold of 300. |
Troubleshooting |
When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to
400 PGs if SSD disks are used. For a majority of deployments that use
modern hardware, it is safe to keep approximately 300 PGs. |
Tuning |
For example, to change the threshold to 500 PGs:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
CephOsdPgNumTooHighCritical:
if: >-
max(ceph_osd_numpg) > 500
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
Note
Ceph prediction alerts have been added starting from the MCP 2019.2.3
update and should be enabled manually. For details, see
Enable the Ceph Prometheus plugin.
CephPredictOsdIOPSthreshold
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The IOPS on the {{ $labels.ceph_daemon }} Ceph OSD are increasing
rapidly. |
Raise condition |
predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} *
86400) > {{osd_iops_limit}}
|
Description |
Predicts the IOPS consumption per Ceph OSD in a specified time range, 1
week by default. The threshold parameter defines the time range.
Warning
For production environments, configure osd_iops_limit
after deployment depending on the used hardware. For exemplary
estimates for different hardware types, see
IOPS.
|
Tuning |
For example, to change osd_iops_limit to 200 :
On the cluster level of the Reclass model in the
cluster/<cluster_name>/ceph/common.yml file, add:
parameters:
_param:
osd_iops_limit: 200
From the Salt Master node, apply the changes:
salt "I@prometheus:server" state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPredictOsdIOPSauto
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The IOPS on the {{ $labels.ceph_daemon }} Ceph OSD are increasing
rapidly. |
Raise condition |
predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} *
86400) > avg_over_time(ceph_osd_op:rate5m[1d]) * {{ iops_threshold }}
|
Description |
Predicts the IOPS consumption per OSD in a specified time range, 1 week
by default. The threshold parameter defines the time range.
Warning
For production environments, configure
osd_iops_threshold after deployment depending on the current
cluster load and estimated limits from
CephPredictOsdIOPSthreshold .
|
Tuning |
For example, to change osd_iops_threshold to 2 :
On the cluster level of the Reclass model in the
cluster/<cluster_name>/ceph/common.yml file, add:
parameters:
_param:
osd_iops_threshold: 2
From the Salt Master node, apply the changes:
salt "I@prometheus:server" state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPredictUsageRAM
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The {{$labels.host}} host may run out of available RAM next week. |
Raise condition |
predict_linear(mem_free{host=~"cmn.*|rgw.*|osd.*"}[{{threshold}}d],
{{threshold}} * 86400) < 0
|
Description |
Predicts the exhaustion of the available RAM on Ceph nodes in a defined
time range. |
Tuning |
Not required |
CephPredictOsdWriteLatency
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The {{$labels.name}} on the {{$labels.host}} host may become
unresponsive shortly. Verify the OSDs top load on the
Ceph OSD Overview Grafana dashboard. |
Raise condition |
predict_linear(diskio_write_time:rate5m
{host=~"osd.*",name=~"sd[b-z]*"}[{{threshold}}d], {{threshold}} *
86400) > avg_over_time(diskio_write_time:rate5m[1d]) *
{{write_latency_threshold}}
|
Description |
Predicts the OSD disks responsiveness in a specified time range based on
the write latency. The threshold parameter defines the time range.
The write_latency_threshold parameter defines the differences to
detect in the write latency.
Warning
For production environments, configure
write_latency_threshold after deployment.
|
Tuning |
For example, to change write_latency_threshold to 2 :
On the cluster level of the Reclass model in the
cluster/<cluster_name>/ceph/common.yml file, add:
parameters:
_param:
write_latency_threshold: 2
From the Salt Master node, apply the changes:
salt "I@prometheus:server" state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPredictOsdReadLatency
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The {{$labels.name}} on the {{$labels.host}} host may become
unresponsive shortly. Verify the OSDs top load on the
Ceph OSD Overview Grafana dashboard. |
Raise condition |
predict_linear(diskio_read_time:rate5m{host=~"osd.*",name=~"sd[b-z]*"}
[{{threshold}}d], {{threshold}} * 86400) >
avg_over_time(diskio_read_time:rate5m[1d]) * {{read_latency_threshold}}
|
Description |
Predicts the OSD disks responsiveness in a specified time range based on
the read latency. The threshold parameter defines the time range.
The read_latency_threshold parameter defines the differences to
detect in the read latency.
Warning
For production environments, configure read_latency_threshold
after deployment.
|
Tuning |
For example, to change read_latency_threshold to 2 :
On the cluster level of the Reclass model in the
cluster/<cluster_name>/ceph/common.yml file, add:
parameters:
_param:
read_latency_threshold: 2
From the Salt Master node, apply the changes:
salt "I@prometheus:server" state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPredictPoolSpace
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The {{pool_name}} pool may consume more than
{{100*space_threshold}}% of the available capacity in 1 week. For
details, run ceph df and plan proper actions. |
Raise condition |
predict_linear(ceph_pool_bytes_used[{{threshold}}d], {{threshold}} *
86400) * on(pool_id) group_left(name)
ceph_pool_metadata{name="{{pool_name}}"} > (ceph_pool_bytes_used +
ceph_pool_max_avail) * {{space_threshold}} * on(pool_id)
group_left(name) ceph_pool_metadata{name="{{pool_name}}"}
|
Description |
Predicts the exhaustion of all available capacity of a pool in a defined
time range. The threshold parameter specifies the time range to use.
The space_threshold parameter defines the capacity threshold,
similar to the one set in CephPool{pool_name}SpaceUsageCritical .
Warning
For production environments, configure space_threshold after
deployment.
|
Tuning |
For example, to change space_threshold to 85 :
On the cluster level of the Reclass model in the
cluster/<cluster_name>/ceph/common.yml file, add:
parameters:
_param:
space_threshold: 85
From the Salt Master node, apply the changes:
salt "I@prometheus:server" state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPredictPoolIOPSthreshold
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The IOPS in the {{pool_name}} are increasing rapidly. |
Raise condition |
predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} *
86400) * on(pool_id) group_left(name)
ceph_pool_metadata{name="{{pool_name}}"} > {{ iops_limit }}
|
Description |
Predicts the IOPS consumption per pool in a specified time range, 1 week
by default. The threshold parameter specifies the time range to use.
Warning
For production environments, after deployment, set
pool_iops_limit to osd_iops_limit from
CephPredictOsdIOPSthreshold multiplied by the number OSDs for
this pool.
|
Tuning |
For example, to change pool_iops_limit to 2000 :
On the cluster level of the Reclass model in the
cluster/<cluster_name>/ceph/common.yml file, add:
parameters:
_param:
pool_iops_limit: 2000
From the Salt Master node, apply the changes:
salt "I@prometheus:server" state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
CephPredictPoolIOPSauto
Available starting from the 2019.2.3 maintenance update
Severity |
Minor |
Summary |
The IOPS in the {{pool_name}} are increasing rapidly. |
Raise condition |
predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} *
86400) * on(pool_id) group_left(name)
ceph_pool_metadata{name="{{pool_name}}"} >
avg_over_time(ceph_pool_ops:rate5m[1d]) * {{ iops_threshold }}
|
Description |
Predicts the IOPS utilisation per pool in a specified time range, 1 week
by default. The threshold parameter specifies the time range to use.
Warning
For production environments, after deployment, set
pool_iops_threshold to iops_limit from
CephPredictOsdIOPSAuto muliplied by the number of OSDs connected
to each pool.
|
Tuning |
For example, to change pool_iops_threshold to 3 :
On the cluster level of the Reclass model in the
cluster/<cluster_name>/ceph/common.yml file, add:
parameters:
_param:
pool_iops_threshold: 3
From the Salt Master node, apply the changes:
salt "I@prometheus:server" state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
RadosGWOutage
Available only in the 2019.2.10 maintenance update
Severity |
Critical |
Summary |
RADOS Gateway outage. |
Raise condition |
max(openstack_api_check_status{name=~"radosgw.*"}) == 0
|
Description |
Raises if RADOS Gateway is not accessible for all available RADOS
Gateway endpoints in the OpenStack service catalog. |
Tuning |
Not required |
RadosGWDown
Available only in the 2019.2.10 maintenance update
Severity |
Major |
Summary |
The {{ $labels.name }} endpoint is not accessible. |
Raise condition |
openstack_api_check_status{name=~"radosgw.*"} == 0
|
Description |
Raises if RADOS Gateway is not accessible for the {{ $labels.name }}
endpoint. |
Tuning |
Not required |