Ceph

Ceph

This section describes the alerts for the Ceph cluster.


CephClusterHealthMinor

Severity Minor
Summary The Ceph cluster is in the WARNING state. For details, run ceph -s.
Raise condition ceph_health_status == 1
Description Raises according to the status reported by the Ceph cluster.
Troubleshooting Run the ceph -s command on any Ceph node to identify the reason and resolve the issue depending on the output.
Tuning Not required

CephClusterHealthCritical

Severity Critical
Summary The Ceph cluster is in the CRITICAL state. For details, run ceph -s.
Raise condition ceph_health_status == 2
Description Raises according to the status reported by the Ceph cluster.
Troubleshooting Run the ceph -s command on any Ceph node to identify the reason and resolve the issue depending on the output.
Tuning Not required

CephMonitorDownMinor

Severity Minor
Summary {{ $value }}% of Ceph Monitors are down. For details, run ceph -s.
Raise condition count(ceph_mon_quorum_status) - sum(ceph_mon_quorum_status) > 0
Description Raises if any of the Ceph Monitors in the Ceph cluster is down.
Troubleshooting Inspect the /var/log/ceph/ceph-mon.<hostname>.log logs on the affected cmn node.
Tuning Not required

CephOsdDownMinor

Severity Minor
Summary {{ $value }}% of Ceph OSDs are down. For details, run ceph osd tree.
Raise condition count(ceph_osd_up) - sum(ceph_osd_up) > 0
Description Raises if any of the Ceph OSD nodes in the Ceph cluster is down.
Troubleshooting Inspect the /var/log/ceph/ceph-osd.<hostname>.log logs on the affected osd node.
Tuning Not required

CephOsdSpaceUsageWarning

Severity Warning
Summary {{ $value }} bytes of the Ceph OSD space (>=75%) is used for 3 minutes. For details, run cephdf.
Raise condition ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * {{threshold}}
Description Raises when a Ceph OSD used space capacity exceeds the threshold of 75%.
Troubleshooting
  • Remove unused data from the Ceph cluster.
  • Add more Ceph OSDs to the Ceph cluster.
  • Adjust the warning threshold (use with caution).
Tuning

For example, to change the threshold to 80%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CephOsdSpaceUsageWarning:
              if: >-
                ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.8
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

CephOsdSpaceUsageMajor

Severity Major
Summary {{ $value }} bytes of the Ceph OSD space (>=85%) is used for 3 minutes. For details, run cephdf.
Raise condition ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * {{threshold}}
Description Raises when a Ceph OSD used space capacity exceeds the threshold of 85%.
Troubleshooting
  • Remove unused data from the Ceph cluster.
  • Add more Ceph OSDs to the Ceph cluster.
  • Adjust the warning threshold (use with caution).
Tuning

For example, to change the threshold to 95%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CephOsdSpaceUsageMajor:
              if: >-
                ceph_cluster_total_used_bytes > ceph_cluster_total_bytes * 0.95
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

CephPool{pool_name}SpaceUsageWarning

Severity Warning
Summary The Ceph {{pool_name}} pool uses 75% of available space for 3 minutes. For details, run ceph df.
Raise condition ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{threshold}}
Description Raises when a Ceph pool used space capacity exceeds the threshold of 75%.
Troubleshooting
  • Add more Ceph OSDs to the Ceph cluster.
  • Temporarily move the affected pool to the less occupied disks of the cluster.
Tuning

Should be tuned per pool. For example, to change the threshold to 80% for pool volumes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CephPoolvolumesSpaceUsageWarning:
              if: >-
                ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \
                on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.8
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

CephPool{pool_name}SpaceUsageCritical

Severity Critical
Summary The Ceph {{pool_name}} pool uses 85% of available space for 3 minutes. For details, run ceph df.
Raise condition ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{threshold}}
Description Raises when a Ceph pool used space capacity exceeds the threshold of 85%.
Troubleshooting
  • Add more Ceph OSDs to the Ceph cluster.
  • Temporarily move the affected pool to the less occupied disks of the cluster.
Tuning

Should be tuned per pool. For example, to change the threshold to 90% for pool volumes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CephPoolvolumesSpaceUsageCritical:
              if: >-
                ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * \
                on(pool_id) group_left(name) ceph_pool_metadata{name="volumes"} > 0.9
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

CephOsdPgNumTooHighWarning

Severity Warning
Summary Some Ceph OSDs contain more than 200 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.
Raise condition max(ceph_osd_numpg) > 200
Description Raises when the number of PGs on Ceph OSDs is higher than the default threshold of 200.
Troubleshooting When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs.
Tuning

For example, to change the threshold to 400 PGs:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CephOsdPgNumTooHighWarning:
              if: >-
                max(ceph_osd_numpg) > 400
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

CephOsdPgNumTooHighCritical

Severity Critical
Summary Some Ceph OSDs contain more than 300 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.
Raise condition max(ceph_osd_numpg) > 300
Description Raises when the number of PGs on Ceph OSDs is bigger than the default threshold of 300.
Troubleshooting When designing a Ceph cluster, keep 100-300 PGs per Ceph OSD and up to 400 PGs if SSD disks are used. For a majority of deployments that use modern hardware, it is safe to keep approximately 300 PGs.
Tuning

For example, to change the threshold to 500 PGs:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            CephOsdPgNumTooHighCritical:
              if: >-
                max(ceph_osd_numpg) > 500
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

Note

Ceph prediction alerts have been added starting from the MCP 2019.2.3 update and should be enabled manually. For details, see Enable the Ceph Prometheus plugin.

CephPredictOsdIOPSthreshold

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The IOPS on the {{ $labels.ceph_daemon }} Ceph OSD are increasing rapidly.
Raise condition predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} * 86400)  > {{osd_iops_limit}}
Description

Predicts the IOPS consumption per Ceph OSD in a specified time range, 1 week by default. The threshold parameter defines the time range.

Warning

For production environments, configure osd_iops_limit after deployment depending on the used hardware. For exemplary estimates for different hardware types, see IOPS.

Tuning

For example, to change osd_iops_limit to 200:

  1. On the cluster level of the Reclass model in the cluster/<cluster_name>/ceph/common.yml file, add:

    parameters:
     _param:
       osd_iops_limit: 200
    
  2. From the Salt Master node, apply the changes:

    salt "I@prometheus:server" state.sls prometheus.server
    
  3. Verify the updated alert definition in the Prometheus web UI.

CephPredictOsdIOPSauto

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The IOPS on the {{ $labels.ceph_daemon }} Ceph OSD are increasing rapidly.
Raise condition predict_linear(ceph_osd_op:rate5m[{{threshold}}d], {{threshold}} * 86400)  > avg_over_time(ceph_osd_op:rate5m[1d]) * {{ iops_threshold }}
Description

Predicts the IOPS consumption per OSD in a specified time range, 1 week by default. The threshold parameter defines the time range.

Warning

For production environments, configure osd_iops_threshold after deployment depending on the current cluster load and estimated limits from CephPredictOsdIOPSthreshold.

Tuning

For example, to change osd_iops_threshold to 2:

  1. On the cluster level of the Reclass model in the cluster/<cluster_name>/ceph/common.yml file, add:

    parameters:
      _param:
        osd_iops_threshold: 2
    
  2. From the Salt Master node, apply the changes:

    salt "I@prometheus:server" state.sls prometheus.server
    
  3. Verify the updated alert definition in the Prometheus web UI.

CephPredictUsageRAM

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The {{$labels.host}} host may run out of available RAM next week.
Raise condition predict_linear(mem_free{host=~"cmn.*|rgw.*|osd.*"}[{{threshold}}d], {{threshold}} * 86400) < 0
Description Predicts the exhaustion of the available RAM on Ceph nodes in a defined time range.
Tuning Not required

CephPredictOsdWriteLatency

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The {{$labels.name}} on the {{$labels.host}} host may become unresponsive shortly. Verify the OSDs top load on the Ceph OSD Overview Grafana dashboard.
Raise condition predict_linear(diskio_write_time:rate5m {host=~"osd.*",name=~"sd[b-z]*"}[{{threshold}}d], {{threshold}} * 86400) > avg_over_time(diskio_write_time:rate5m[1d]) * {{write_latency_threshold}}
Description

Predicts the OSD disks responsiveness in a specified time range based on the write latency. The threshold parameter defines the time range. The write_latency_threshold parameter defines the differences to detect in the write latency.

Warning

For production environments, configure write_latency_threshold after deployment.

Tuning

For example, to change write_latency_threshold to 2:

  1. On the cluster level of the Reclass model in the cluster/<cluster_name>/ceph/common.yml file, add:

    parameters:
      _param:
        write_latency_threshold: 2
    
  2. From the Salt Master node, apply the changes:

    salt "I@prometheus:server" state.sls prometheus.server
    
  3. Verify the updated alert definition in the Prometheus web UI.

CephPredictOsdReadLatency

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The {{$labels.name}} on the {{$labels.host}} host may become unresponsive shortly. Verify the OSDs top load on the Ceph OSD Overview Grafana dashboard.
Raise condition predict_linear(diskio_read_time:rate5m{host=~"osd.*",name=~"sd[b-z]*"} [{{threshold}}d], {{threshold}} * 86400) > avg_over_time(diskio_read_time:rate5m[1d]) * {{read_latency_threshold}}
Description

Predicts the OSD disks responsiveness in a specified time range based on the read latency. The threshold parameter defines the time range. The read_latency_threshold parameter defines the differences to detect in the read latency.

Warning

For production environments, configure read_latency_threshold after deployment.

Tuning

For example, to change read_latency_threshold to 2:

  1. On the cluster level of the Reclass model in the cluster/<cluster_name>/ceph/common.yml file, add:

    parameters:
      _param:
        read_latency_threshold: 2
    
  2. From the Salt Master node, apply the changes:

    salt "I@prometheus:server" state.sls prometheus.server
    
  3. Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolSpace

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The {{pool_name}} pool may consume more than {{100*space_threshold}}% of the available capacity in 1 week. For details, run ceph df and plan proper actions.
Raise condition predict_linear(ceph_pool_bytes_used[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > (ceph_pool_bytes_used + ceph_pool_max_avail) * {{space_threshold}} * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"}
Description

Predicts the exhaustion of all available capacity of a pool in a defined time range. The threshold parameter specifies the time range to use. The space_threshold parameter defines the capacity threshold, similar to the one set in CephPool{pool_name}SpaceUsageCritical.

Warning

For production environments, configure space_threshold after deployment.

Tuning

For example, to change space_threshold to 85:

  1. On the cluster level of the Reclass model in the cluster/<cluster_name>/ceph/common.yml file, add:

    parameters:
      _param:
        space_threshold: 85
    
  2. From the Salt Master node, apply the changes:

    salt "I@prometheus:server" state.sls prometheus.server
    
  3. Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolIOPSthreshold

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The IOPS in the {{pool_name}} are increasing rapidly.
Raise condition predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > {{ iops_limit }}
Description

Predicts the IOPS consumption per pool in a specified time range, 1 week by default. The threshold parameter specifies the time range to use.

Warning

For production environments, after deployment, set pool_iops_limit to osd_iops_limit from CephPredictOsdIOPSthreshold multiplied by the number OSDs for this pool.

Tuning

For example, to change pool_iops_limit to 2000:

  1. On the cluster level of the Reclass model in the cluster/<cluster_name>/ceph/common.yml file, add:

    parameters:
      _param:
        pool_iops_limit: 2000
    
  2. From the Salt Master node, apply the changes:

    salt "I@prometheus:server" state.sls prometheus.server
    
  3. Verify the updated alert definition in the Prometheus web UI.

CephPredictPoolIOPSauto

Available starting from the 2019.2.3 maintenance update

Severity Minor
Summary The IOPS in the {{pool_name}} are increasing rapidly.
Raise condition predict_linear(ceph_pool_ops:rate5m[{{threshold}}d], {{threshold}} * 86400) * on(pool_id) group_left(name) ceph_pool_metadata{name="{{pool_name}}"} > avg_over_time(ceph_pool_ops:rate5m[1d]) * {{ iops_threshold }}
Description

Predicts the IOPS utilisation per pool in a specified time range, 1 week by default. The threshold parameter specifies the time range to use.

Warning

For production environments, after deployment, set pool_iops_threshold to iops_limit from CephPredictOsdIOPSAuto muliplied by the number of OSDs connected to each pool.

Tuning

For example, to change pool_iops_threshold to 3:

  1. On the cluster level of the Reclass model in the cluster/<cluster_name>/ceph/common.yml file, add:

    parameters:
      _param:
        pool_iops_threshold: 3
    
  2. From the Salt Master node, apply the changes:

    salt "I@prometheus:server" state.sls prometheus.server
    
  3. Verify the updated alert definition in the Prometheus web UI.

RadosGWOutage

Available only in the 2019.2.10 maintenance update

Severity Critical
Summary RADOS Gateway outage.
Raise condition max(openstack_api_check_status{name=~"radosgw.*"}) == 0
Description Raises if RADOS Gateway is not accessible for all available RADOS Gateway endpoints in the OpenStack service catalog.
Tuning Not required

RadosGWDown

Available only in the 2019.2.10 maintenance update

Severity Major
Summary The {{ $labels.name }} endpoint is not accessible.
Raise condition openstack_api_check_status{name=~"radosgw.*"} == 0
Description Raises if RADOS Gateway is not accessible for the {{ $labels.name }} endpoint.
Tuning Not required