GlusterFS

GlusterFS

This section describes the alerts for the GlusterFS service.


GlusterfsServiceMinor

Severity Minor
Summary The GlusterFS service on the {{ $labels.host }} host is down.
Raise condition procstat_running{process_name="glusterd"} < 1
Description Raises when Telegraf cannot find running glusterd processes on the kvm hosts.
Troubleshooting
  • Verify the GlusterFS status using systemctl status glusterfs-server.
  • Inspect GlusterFS logs in the /var/log/glusterfs/ directory.
Tuning Not required

GlusterfsServiceOutage

Severity Critical
Summary All GlusterFS services are down.
Raise condition glusterfs_up != 1
Description Raises when the Telegraf service cannot connect or gather metrics from the GlusterFS service, typically meaning the GlusterFS, Telegraf monitoring_remote_agent service, or network issues.
Troubleshooting
  • Inspect the Telegraf monitoring_remote_agent service logs by running docker service logs monitoring_remote_agent on any mon node.
  • Verify the GlusterFS status using systemctl status glusterfs-server.
  • Inspect GlusterFS logs in the /var/log/glusterfs/ directory.
Tuning Not required

GlusterfsInodesUsedMinor

Severity Minor
Summary More than 80% of GlusterFS {{ $labels.volume }} volume inodes are used for 2 minutes.
Raise condition glusterfs_inodes_percent_used >= {{ monitoring.inodes_percent_used_minor_threshold_percent*100 }} and glusterfs_inodes_percent_used < {{ monitoring.inodes_percent_used_major_threshold_percent*100 }}
Description Raises when GlusterFS uses more than 80% and less than 90% of available inodes. The volume label in the raised alert contains the affected GlusterFS volume.
Troubleshooting
  • Verify the number of objects stored in GlusterFS.
  • If possible, increase the number of inodes.
Tuning

Typically, you should not change the default value. If the alert is constantly firing, verify the available inodes on the GlusterFS nodes and adjust the threshold according to the number of available inodes. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold.

For example, to change the threshold to the 90 - 95% interval:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            GlusterfsInodesUsedMinor:
              if: >-
                glusterfs_inodes_percent_used >= 90 and \
                glusterfs_inodes_percent_used < 95
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

GlusterfsInodesUsedMajor

Severity Major
Summary {{ $value }}% of GlusterFS {{ $labels.volume }} volume inodes are used for 2 minutes.
Raise condition glusterfs_inodes_percent_used >= {{ monitoring.inodes_percent_used_major_threshold_percent*100 }}
Description Raises when GlusterFS uses more than 90% of available inodes. The volume label in the raised alert contains the affected GlusterFS volume.
Troubleshooting
  • Verify the number of objects stored in GlusterFS.
  • If possible, increase the number of inodes.
Tuning

Typically, you should not change the default value. If the alert is constantly firing, verify the available inodes on the GlusterFS nodes and adjust the threshold according to the number of available inodes. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold.

For example, to change the threshold to 95%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            GlusterfsInodesUsedMajor:
              if: >-
                glusterfs_inodes_percent_used >= 95
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

GlusterfsSpaceUsedMinor

Severity Minor
Summary {{ $value }}% of GlusterFS {{ $labels.volume }} volume disk space is used for 2 minutes.
Raise condition glusterfs_space_percent_used >= {{ monitoring.space_percent_used_minor_threshold_percent*100 }} and glusterfs_space_percent_used < {{ monitoring.space_percent_used_major_threshold_percent*100 }}
Description Raises when GlusterFS uses more than 80% and less than 90% of available space. The volume label in the raised alert contains the affected GlusterFS volume.
Troubleshooting
  • Inspect the data stored in GlusterFS.
  • Increase the GlusterFS capacity.
Tuning

Typically, you should not change the default value. If the alert is constantly firing, verify the available space on the GlusterFS nodes and adjust the threshold accordingly. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold.

For example, to change the threshold to the 90-95% interval:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            GlusterfsSpaceUsedMinor:
              if: >-
                glusterfs_space_percent_used >= 90 and \
                glusterfs_space_percent_used < 95
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

GlusterfsSpaceUsedMajor

Severity Minor
Summary {{ $value }}% of GlusterFS {{ $labels.volume }} volume disk space is used for 2 minutes.
Raise condition glusterfs_space_percent_used >= {{ monitoring.space_percent_used_major_threshold_percent*100 }}
Description Raises when GlusterFS uses more than 90% of available space. The volume label in the raised alert contains the affected GlusterFS volume.
Troubleshooting
  • Inspect the data stored in GlusterFS.
  • Increase the GlusterFS capacity.
Tuning

Typically, you should not change the default value. If the alert is constantly firing, verify the available space on the GlusterFS nodes and adjust the threshold accordingly. In the Prometheus Web UI, use the raise condition query for a longer period of time to define the best threshold.

For example, to change the threshold to 95%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            GlusterfsSpaceUsedMajor:
              if: >-
                glusterfs_space_percent_used >= 95
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.


GlusterfsMountMissing

Available since the 2019.2.11 maintenance update

Severity Major
Summary GlusterFS mount point is not mounted.
Raise condition delta(glusterfs_mount_scrapes:rate5m{fstype=~"(fuse.)?glusterfs"}[5m]) < 0 or glusterfs_mount_scrapes:rate5m{fstype=~"(fuse.)?glusterfs"} == 0
Description Raises when a GlusterFS mount point is not mounted. The path, host, and device labels in the raised alert contain the path, node name, and device of the affected mount point.
Troubleshooting To perform a manual remount, apply the salt '*' state.sls glusterfs.client Salt state from the Salt Master node.
Tuning Not required