RabbitMQ

RabbitMQ

This section describes the alerts for the RabbitMQ service.


RabbitmqServiceDown

Severity Critical
Summary The RabbitMQ service on the {{$labels.host}} node is down.
Raise condition rabbitmq_up == 0
Description Raises when the RabbitMQ service is down on one node, which affects the RabbitMQ availability. The alert raises 1 minute after the issue occurrence. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Verify the RabbitMQ status using systemctl status rabbitmq-server.
  • Inspect the RabbitMQ logs in /var/log/rabbitmq.
  • Verify that the node has enough resources, such as disk space or RAM.
Tuning Not required

RabbitmqServiceOutage

Severity Critical
Summary All RabbitMQ services are down.
Raise condition count(rabbitmq_up == 0) == count(rabbitmq_up)
Description Raises when RabbitMQ is down on all nodes, indicating that the service is unavailable. The alert raises 1 minute after the issue occurrence.
Troubleshooting
  • Verify the RabbitMQ status on the msg nodes using systemctl status rabbitmq-server.
  • Inspect the RabbitMQ logs in the /var/log/rabbitmq directory on the msg nodes.
  • Verify that the node has enough resources such as disk space or RAM.
Tuning Not required

RabbitMQUnequalQueueCritical

Available starting from the 2019.2.5 maintenance update

Severity Critical
Summary The RabbitMQ service has unequal number of queues across the cluster instances.
Raise condition max(rabbitmq_overview_queues) != min(rabbitmq_overview_queues)
Description Raises when the RabbitMQ cluster nodes have inconsistent number of queues for 10 minutes. This issue can occur after service restart and cause the inaccessibility of RabbitMQ.
Troubleshooting Contact Mirantis support.
Tuning Not required

RabbitmqDiskFullWarning

Severity Warning
Summary The RabbitMQ service on the {{$labels.host}} node has less than 500 MB of free disk space.
Raise condition rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 10
Description Rasies when the consumption of the available disk space by RabbitMQ reaches 500 MB (by default, 10 multiplied by 50 MB). RabbitMQ checks the available disk space more frequently as it shrinks, which can affect system load.
Troubleshooting Free or add more disk space on the affected node.
Tuning

To change the threshold to 15:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqDiskFullWarning:
              if: >-
                rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 15
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqDiskFullCritical

Severity Critical
Summary The RabbitMQ disk space on the {{$labels.host}} node is full.
Raise condition rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit
Description Raises when RabbitMQ uses all the available disk space (less than 50 MB by default). The alert is cluster-wide. RabbitMQ blocks producers and in-memory messages from paging to the disk. Frequent disk checks contribute to load growth. The host label in the raised alert contains the host name of the affected node.
Troubleshooting Add more disk space on the affected node.
Tuning Not required

RabbitmqMemoryLowWarning

Severity Warning
Summary The RabbitMQ service uses more than 80% of memory on the {{$labels.host}} node for 2 minutes.
Raise condition 100 * rabbitmq_node_mem_used / rabbitmq_node_mem_limit >= 100 * 0.8
Description Raises when the RabbitMQ memory consumption reaches the warning threshold (80% of allocated memory by default). The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Edit the high memory watermark in the service configuration.

  • Increase paging for RabbitMQ through CLI. For example:

    rabbitmqctl -n rabbit@msg01 set_global_parameter \
    vm_memory_high_watermark_paging_ratio 0.75
    

    The service restart will reset this change.

  • Add more memory on the affected node.

Tuning

To change the watermark:

  1. On the cluster level of the Reclass model, specify the vm_high_watermark parameter in openstack/message_queue.yml. For example:

    rabbitmq:
      server:
        memory:
          vm_high_watermark: 0.8
    
  2. From the Salt Master node, apply the following states one by one:

    salt '*' saltutil.refresh_pillar
    salt -C 'I@rabbitmq:server' state.sls rabbitmq.server
    

RabbitmqMemoryLowCritical

Severity Critical
Summary The RabbitMQ service on the {{$labels.host}} node is out of memory.
Raise condition rabbitmq_node_mem_used >= rabbitmq_node_mem_limit
Description RabbitMQ uses all the allocated memory and blocks all connections that are publishing messages to prevent further usage growth. The host label in the raised alert contains the host name of the affected node. If other system services consume more RAM, the system may start to swap, which can cause the Erlang VM crash and RabbitMQ can go down on the node.
Troubleshooting
  • Increase paging for RabbitMQ through CLI. For example:

    rabbitmqctl -n rabbit@msg01 set_global_parameter \
    vm_memory_high_watermark_paging_ratio 0.75
    

    The service restart will reset this change.

  • Add more memory on the affected node.

Tuning Not required

RabbitmqMessagesTooHigh

Severity Warning
Summary The RabbitMQ service on the {{$labels.host}} node has received more than 2^20 messages.
Raise condition rabbitmq_overview_messages > 2^20
Description Raises when the quantity of messages received by RabbitMQ exceeds the warning limit, (by default, 1024 multiplied by 1024), typically indicating that some consumer may not peek messages from the queues.
Troubleshooting Verify if huge queues are present using rabbitmqctl list_queues.
Tuning

For example, to change the disk warning threshold to 2^21:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqMessagesTooHigh:
              if: >-
                rabbitmq_overview_messages > 2^21
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsTooHigh

Severity Critical Major in 2019.2.4
Summary The rate of errors in RabbitMQ logs is more than 0.2 error messages per second on the {{$labels.host}} node as measured over 5 minutes.
Raise condition
  • In 2019.2.4: sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error|emergency|fatal))"}[5m])) without (level) > 0.2
  • In 2019.2.5: sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error|critical))"}[5m])) without (level) > 0.05
Description Raises when the average per-second rate of the error, fatal, or emergency messages in the RabbitMQ logs on the node is more than 0.2 per second. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity. The host label in the raised alert contains the host name of the affected node.
Troubleshooting Inspect the log files in the /var/log/rabbitmq directory on the affected node.
Tuning

Typically, you should not change the default value. If the alert is constantly firing, inspect the RabbitMQ logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold.

For example, to change the threshold to 0.4:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service=""rabbitmq"", level=~""(?i:\
                (error|emergency|fatal))""}[5m])) without (level) > 0.4
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsMajor

Available starting from the 2019.2.5 maintenance update

Severity Major
Summary The RabbitMQ logs on the {{ $labels.host }} node contain errors (as measured over the last 30 minutes).
Raise condition sum(increase(log_messages{service="rabbitmq",level=~"(?i:(error| critical))"}[30m])) without (level) > 0
Description Raises when the error or critical messages appear in the RabbitMQ logs on a node. The host label in the raised alert contains the name of the affected node. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity.
Troubleshooting Inspect the log files in the /var/log/rabbitmq directory on the affected node.
Tuning Not required

RabbitmqFdUsageWarning

Available starting from the 2019.2.6 maintenance update

Severity Warning
Summary The RabbitMQ service uses {{ $value }}% of all available file descriptors on the {{ $labels.host }} node.
Raise condition rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 70
Description

Raises when the RabbitMQ instance uses 70% of available file descriptors.

Warning

For production environments, configure the alert after deployment.

Troubleshooting
  • Inspect openstack/control.yml in the cluster model to verify if the default value of the OpenStack service rpc_workers was overwritten.
  • Decrease the rpc_workers value and apply the state for the corresponding service.
Tuning

For example, to change the threshold to 60%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqFdUsageWarning:
              if: >-
                rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 60
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqFdUsageCritical

Available starting from the 2019.2.6 maintenance update

Severity Critical
Summary The RabbitMQ service uses {{ $value }}% of all available file descriptors on the {{ $labels.host }} node.
Raise condition rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 95
Description

Raises when the RabbitMQ instance uses 95% of available file descriptors, indicating that RabbitMQ is about to crash.

Warning

For production environments, configure the alert after deployment.

Troubleshooting
  • Inspect openstack/control.yml in the cluster model to verify if the default value of the OpenStack service rpc_workers was overwritten.
  • Decrease the rpc_workers value and apply the state for the corresponding service.
Tuning

For example, to change the threshold to 87%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqFdUsageCritical:
              if: >-
                rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 87
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.