RabbitMQ

RabbitMQ

This section describes the alerts for the RabbitMQ service.


RabbitmqServiceDown

Severity

Critical

Summary

The RabbitMQ service on the {{$labels.host}} node is down.

Raise condition

rabbitmq_up == 0

Description

Raises when the RabbitMQ service is down on one node, which affects the RabbitMQ availability. The alert raises 1 minute after the issue occurrence. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the RabbitMQ status using systemctl status rabbitmq-server.

  • Inspect the RabbitMQ logs in /var/log/rabbitmq.

  • Verify that the node has enough resources, such as disk space or RAM.

Tuning

Not required

RabbitmqServiceOutage

Severity

Critical

Summary

All RabbitMQ services are down.

Raise condition

count(rabbitmq_up == 0) == count(rabbitmq_up)

Description

Raises when RabbitMQ is down on all nodes, indicating that the service is unavailable. The alert raises 1 minute after the issue occurrence.

Troubleshooting

  • Verify the RabbitMQ status on the msg nodes using systemctl status rabbitmq-server.

  • Inspect the RabbitMQ logs in the /var/log/rabbitmq directory on the msg nodes.

  • Verify that the node has enough resources such as disk space or RAM.

Tuning

Not required

RabbitMQUnequalQueueCritical

Available starting from the 2019.2.5 maintenance update

Severity

Critical

Summary

The RabbitMQ service has unequal number of queues across the cluster instances.

Raise condition

max(rabbitmq_overview_queues) != min(rabbitmq_overview_queues)

Description

Raises when the RabbitMQ cluster nodes have inconsistent number of queues for 10 minutes. This issue can occur after service restart and cause the inaccessibility of RabbitMQ.

Troubleshooting

Contact Mirantis support.

Tuning

Not required

RabbitmqDiskFullWarning

Severity

Warning

Summary

The RabbitMQ service on the {{$labels.host}} node has less than 500 MB of free disk space.

Raise condition

rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 10

Description

Rasies when the consumption of the available disk space by RabbitMQ reaches 500 MB (by default, 10 multiplied by 50 MB). RabbitMQ checks the available disk space more frequently as it shrinks, which can affect system load.

Troubleshooting

Free or add more disk space on the affected node.

Tuning

To change the threshold to 15:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqDiskFullWarning:
              if: >-
                rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 15
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqDiskFullCritical

Severity

Critical

Summary

The RabbitMQ disk space on the {{$labels.host}} node is full.

Raise condition

rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit

Description

Raises when RabbitMQ uses all the available disk space (less than 50 MB by default). The alert is cluster-wide. RabbitMQ blocks producers and in-memory messages from paging to the disk. Frequent disk checks contribute to load growth. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

Add more disk space on the affected node.

Tuning

Not required

RabbitmqMemoryLowWarning

Severity

Warning

Summary

The RabbitMQ service uses more than 80% of memory on the {{$labels.host}} node for 2 minutes.

Raise condition

100 * rabbitmq_node_mem_used / rabbitmq_node_mem_limit >= 100 * 0.8

Description

Raises when the RabbitMQ memory consumption reaches the warning threshold (80% of allocated memory by default). The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Edit the high memory watermark in the service configuration.

  • Increase paging for RabbitMQ through CLI. For example:

    rabbitmqctl -n rabbit@msg01 set_global_parameter \
    vm_memory_high_watermark_paging_ratio 0.75
    

    The service restart will reset this change.

  • Add more memory on the affected node.

Tuning

To change the watermark:

  1. On the cluster level of the Reclass model, specify the vm_high_watermark parameter in openstack/message_queue.yml. For example:

    rabbitmq:
      server:
        memory:
          vm_high_watermark: 0.8
    
  2. From the Salt Master node, apply the following states one by one:

    salt '*' saltutil.refresh_pillar
    salt -C 'I@rabbitmq:server' state.sls rabbitmq.server
    

RabbitmqMemoryLowCritical

Severity

Critical

Summary

The RabbitMQ service on the {{$labels.host}} node is out of memory.

Raise condition

rabbitmq_node_mem_used >= rabbitmq_node_mem_limit

Description

RabbitMQ uses all the allocated memory and blocks all connections that are publishing messages to prevent further usage growth. The host label in the raised alert contains the host name of the affected node. If other system services consume more RAM, the system may start to swap, which can cause the Erlang VM crash and RabbitMQ can go down on the node.

Troubleshooting

  • Increase paging for RabbitMQ through CLI. For example:

    rabbitmqctl -n rabbit@msg01 set_global_parameter \
    vm_memory_high_watermark_paging_ratio 0.75
    

    The service restart will reset this change.

  • Add more memory on the affected node.

Tuning

Not required

RabbitmqMessagesTooHigh

Severity

Warning

Summary

The RabbitMQ service on the {{$labels.host}} node has received more than 2^20 messages.

Raise condition

rabbitmq_overview_messages > 2^20

Description

Raises when the quantity of messages received by RabbitMQ exceeds the warning limit, (by default, 1024 multiplied by 1024), typically indicating that some consumer may not peek messages from the queues.

Troubleshooting

Verify if huge queues are present using rabbitmqctl list_queues.

Tuning

For example, to change the disk warning threshold to 2^21:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqMessagesTooHigh:
              if: >-
                rabbitmq_overview_messages > 2^21
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsTooHigh

Severity

Critical Major in 2019.2.4

Summary

The rate of errors in RabbitMQ logs is more than 0.2 error messages per second on the {{$labels.host}} node as measured over 5 minutes.

Raise condition

  • In 2019.2.4: sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error|emergency|fatal))"}[5m])) without (level) > 0.2

  • In 2019.2.5: sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error|critical))"}[5m])) without (level) > 0.05

Description

Raises when the average per-second rate of the error, fatal, or emergency messages in the RabbitMQ logs on the node is more than 0.2 per second. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

Inspect the log files in the /var/log/rabbitmq directory on the affected node.

Tuning

Typically, you should not change the default value. If the alert is constantly firing, inspect the RabbitMQ logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold.

For example, to change the threshold to 0.4:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqErrorLogsTooHigh:
              if: >-
                sum(rate(log_messages{service=""rabbitmq"", level=~""(?i:\
                (error|emergency|fatal))""}[5m])) without (level) > 0.4
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsMajor

Available starting from the 2019.2.5 maintenance update

Severity

Major

Summary

The RabbitMQ logs on the {{ $labels.host }} node contain errors (as measured over the last 30 minutes).

Raise condition

sum(increase(log_messages{service="rabbitmq",level=~"(?i:(error| critical))"}[30m])) without (level) > 0

Description

Raises when the error or critical messages appear in the RabbitMQ logs on a node. The host label in the raised alert contains the name of the affected node. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity.

Troubleshooting

Inspect the log files in the /var/log/rabbitmq directory on the affected node.

Tuning

Not required

RabbitmqFdUsageWarning

Available starting from the 2019.2.6 maintenance update

Severity

Warning

Summary

The RabbitMQ service uses {{ $value }}% of all available file descriptors on the {{ $labels.host }} node.

Raise condition

rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 70

Description

Raises when the RabbitMQ instance uses 70% of available file descriptors.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

  • Inspect openstack/control.yml in the cluster model to verify if the default value of the OpenStack service rpc_workers was overwritten.

  • Decrease the rpc_workers value and apply the state for the corresponding service.

Tuning

For example, to change the threshold to 60%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqFdUsageWarning:
              if: >-
                rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 60
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RabbitmqFdUsageCritical

Available starting from the 2019.2.6 maintenance update

Severity

Critical

Summary

The RabbitMQ service uses {{ $value }}% of all available file descriptors on the {{ $labels.host }} node.

Raise condition

rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 95

Description

Raises when the RabbitMQ instance uses 95% of available file descriptors, indicating that RabbitMQ is about to crash.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

  • Inspect openstack/control.yml in the cluster model to verify if the default value of the OpenStack service rpc_workers was overwritten.

  • Decrease the rpc_workers value and apply the state for the corresponding service.

Tuning

For example, to change the threshold to 87%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RabbitmqFdUsageCritical:
              if: >-
                rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 87
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.