This section describes the alerts for the RabbitMQ service.
Severity | Critical |
---|---|
Summary | The RabbitMQ service on the {{$labels.host}} node is down. |
Raise condition | rabbitmq_up == 0 |
Description | Raises when the RabbitMQ service is down on one node, which affects the
RabbitMQ availability. The alert raises 1 minute after the issue
occurrence.
The host label in the raised alert contains
the host name of the affected node. |
Troubleshooting |
|
Tuning | Not required |
Severity | Critical |
---|---|
Summary | All RabbitMQ services are down. |
Raise condition | count(rabbitmq_up == 0) == count(rabbitmq_up) |
Description | Raises when RabbitMQ is down on all nodes, indicating that the service is unavailable. The alert raises 1 minute after the issue occurrence. |
Troubleshooting |
|
Tuning | Not required |
Available starting from the 2019.2.5 maintenance update
Severity | Critical |
---|---|
Summary | The RabbitMQ service has unequal number of queues across the cluster instances. |
Raise condition | max(rabbitmq_overview_queues) != min(rabbitmq_overview_queues) |
Description | Raises when the RabbitMQ cluster nodes have inconsistent number of queues for 10 minutes. This issue can occur after service restart and cause the inaccessibility of RabbitMQ. |
Troubleshooting | Contact Mirantis support. |
Tuning | Not required |
Severity | Warning |
---|---|
Summary | The RabbitMQ service on the {{$labels.host}} node has less than
500 MB of free disk space. |
Raise condition | rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 10 |
Description | Rasies when the consumption of the available disk space by RabbitMQ reaches 500 MB (by default, 10 multiplied by 50 MB). RabbitMQ checks the available disk space more frequently as it shrinks, which can affect system load. |
Troubleshooting | Free or add more disk space on the affected node. |
Tuning | To change the threshold to
|
Severity | Critical |
---|---|
Summary | The RabbitMQ disk space on the {{$labels.host}} node is full. |
Raise condition | rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit |
Description | Raises when RabbitMQ uses all the available disk space (less than 50 MB
by default). The alert is cluster-wide. RabbitMQ blocks producers and
in-memory messages from paging to the disk. Frequent disk checks
contribute to load growth. The host label in the raised alert
contains the host name of the affected node. |
Troubleshooting | Add more disk space on the affected node. |
Tuning | Not required |
Severity | Warning |
---|---|
Summary | The RabbitMQ service uses more than 80% of memory on the
{{$labels.host}} node for 2 minutes. |
Raise condition | 100 * rabbitmq_node_mem_used / rabbitmq_node_mem_limit >= 100 * 0.8 |
Description | Raises when the RabbitMQ memory consumption reaches the warning
threshold (80% of allocated memory by default). The host label in
the raised alert contains the host name of the affected node. |
Troubleshooting |
|
Tuning | To change the watermark:
|
Severity | Critical |
---|---|
Summary | The RabbitMQ service on the {{$labels.host}} node is out of memory. |
Raise condition | rabbitmq_node_mem_used >= rabbitmq_node_mem_limit |
Description | RabbitMQ uses all the allocated memory and blocks all connections that
are publishing messages to prevent further usage growth. The host
label in the raised alert contains the host name of the affected node.
If other system services consume more RAM, the system may start to swap,
which can cause the Erlang VM crash and RabbitMQ can go down on the
node. |
Troubleshooting |
|
Tuning | Not required |
Severity | Warning |
---|---|
Summary | The RabbitMQ service on the {{$labels.host}} node has received more
than 2^20 messages. |
Raise condition | rabbitmq_overview_messages > 2^20 |
Description | Raises when the quantity of messages received by RabbitMQ exceeds the warning limit, (by default, 1024 multiplied by 1024), typically indicating that some consumer may not peek messages from the queues. |
Troubleshooting | Verify if huge queues are present using rabbitmqctl list_queues . |
Tuning | For example, to change the disk warning threshold to
|
Severity | Critical Major in 2019.2.4 |
---|---|
Summary | The rate of errors in RabbitMQ logs is more than 0.2 error messages per
second on the {{$labels.host}} node as measured over 5 minutes. |
Raise condition |
|
Description | Raises when the average per-second rate of the error , fatal , or
emergency messages in the RabbitMQ logs on the node is more than
0.2 per second. Fluentd forwards all logs from RabbitMQ to Elasticsearch
and counts the number of log messages per severity. The host label
in the raised alert contains the host name of the affected node. |
Troubleshooting | Inspect the log files in the /var/log/rabbitmq directory on the
affected node. |
Tuning | Typically, you should not change the default value. If the alert is constantly firing, inspect the RabbitMQ logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to
|
Available starting from the 2019.2.5 maintenance update
Severity | Major |
---|---|
Summary | The RabbitMQ logs on the {{ $labels.host }} node contain errors (as
measured over the last 30 minutes). |
Raise condition | sum(increase(log_messages{service="rabbitmq",level=~"(?i:(error|
critical))"}[30m])) without (level) > 0 |
Description | Raises when the error or critical messages appear in the
RabbitMQ logs on a node. The host label in the raised alert contains
the name of the affected node. Fluentd forwards all logs from RabbitMQ
to Elasticsearch and counts the number of log messages per severity. |
Troubleshooting | Inspect the log files in the /var/log/rabbitmq directory on the
affected node. |
Tuning | Not required |
Available starting from the 2019.2.6 maintenance update
Severity | Warning |
---|---|
Summary | The RabbitMQ service uses {{ $value }}% of all available file
descriptors on the {{ $labels.host }} node. |
Raise condition | rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 70 |
Description | Raises when the RabbitMQ instance uses 70% of available file descriptors. Warning For production environments, configure the alert after deployment. |
Troubleshooting |
|
Tuning | For example, to change the threshold to
|
Available starting from the 2019.2.6 maintenance update
Severity | Critical |
---|---|
Summary | The RabbitMQ service uses {{ $value }}% of all available file
descriptors on the {{ $labels.host }} node. |
Raise condition | rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 95 |
Description | Raises when the RabbitMQ instance uses 95% of available file descriptors, indicating that RabbitMQ is about to crash. Warning For production environments, configure the alert after deployment. |
Troubleshooting |
|
Tuning | For example, to change the threshold to
|