RabbitMQ

RabbitMQ¶

This section describes the alerts for the RabbitMQ service.

RabbitmqServiceDown
RabbitmqServiceOutage
RabbitMQUnequalQueueCritical
RabbitmqDiskFullWarning
RabbitmqDiskFullCritical
RabbitmqMemoryLowWarning
RabbitmqMemoryLowCritical
RabbitmqMessagesTooHigh
RabbitmqErrorLogsTooHigh
RabbitmqErrorLogsMajor
RabbitmqFdUsageWarning
RabbitmqFdUsageCritical

RabbitmqServiceDown¶

Severity	Critical
Summary	The RabbitMQ service on the `{{$labels.host}}` node is down.
Raise condition	`rabbitmq_up == 0`
Description	Raises when the RabbitMQ service is down on one node, which affects the RabbitMQ availability. The alert raises 1 minute after the issue occurrence. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the RabbitMQ status using `systemctl status rabbitmq-server`. Inspect the RabbitMQ logs in `/var/log/rabbitmq`. Verify that the node has enough resources, such as disk space or RAM.
Tuning	Not required

RabbitmqServiceOutage¶

Severity	Critical
Summary	All RabbitMQ services are down.
Raise condition	`count(rabbitmq_up == 0) == count(rabbitmq_up)`
Description	Raises when RabbitMQ is down on all nodes, indicating that the service is unavailable. The alert raises 1 minute after the issue occurrence.
Troubleshooting	Verify the RabbitMQ status on the `msg` nodes using `systemctl status rabbitmq-server`. Inspect the RabbitMQ logs in the `/var/log/rabbitmq` directory on the `msg` nodes. Verify that the node has enough resources such as disk space or RAM.
Tuning	Not required

RabbitMQUnequalQueueCritical¶

^{Available starting from the 2019.2.5 maintenance update}

Severity	Critical
Summary	The RabbitMQ service has unequal number of queues across the cluster instances.
Raise condition	`max(rabbitmq_overview_queues) != min(rabbitmq_overview_queues)`
Description	Raises when the RabbitMQ cluster nodes have inconsistent number of queues for 10 minutes. This issue can occur after service restart and cause the inaccessibility of RabbitMQ.
Troubleshooting	Contact Mirantis support.
Tuning	Not required

RabbitmqDiskFullWarning¶

Severity	Warning
Summary	The RabbitMQ service on the `{{$labels.host}}` node has less than 500 MB of free disk space.
Raise condition	`rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 10`
Description	Rasies when the consumption of the available disk space by RabbitMQ reaches 500 MB (by default, 10 multiplied by 50 MB). RabbitMQ checks the available disk space more frequently as it shrinks, which can affect system load.
Troubleshooting	Free or add more disk space on the affected node.
Tuning	To change the threshold to `15`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqDiskFullWarning: if: >- rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 15 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqDiskFullCritical¶

Severity	Critical
Summary	The RabbitMQ disk space on the `{{$labels.host}}` node is full.
Raise condition	`rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit`
Description	Raises when RabbitMQ uses all the available disk space (less than 50 MB by default). The alert is cluster-wide. RabbitMQ blocks producers and in-memory messages from paging to the disk. Frequent disk checks contribute to load growth. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Add more disk space on the affected node.
Tuning	Not required

RabbitmqMemoryLowWarning¶

Severity	Warning
Summary	The RabbitMQ service uses more than 80% of memory on the `{{$labels.host}}` node for 2 minutes.
Raise condition	`100 * rabbitmq_node_mem_used / rabbitmq_node_mem_limit >= 100 * 0.8`
Description	Raises when the RabbitMQ memory consumption reaches the warning threshold (80% of allocated memory by default). The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Edit the high memory watermark in the service configuration. Increase paging for RabbitMQ through CLI. For example: rabbitmqctl -n rabbit@msg01 set_global_parameter \ vm_memory_high_watermark_paging_ratio 0.75 The service restart will reset this change. Add more memory on the affected node.
Tuning	To change the watermark: On the cluster level of the Reclass model, specify the `vm_high_watermark` parameter in `openstack/message_queue.yml`. For example: rabbitmq: server: memory: vm_high_watermark: 0.8 From the Salt Master node, apply the following states one by one: salt '*' saltutil.refresh_pillar salt -C 'I@rabbitmq:server' state.sls rabbitmq.server

RabbitmqMemoryLowCritical¶

Severity	Critical
Summary	The RabbitMQ service on the `{{$labels.host}}` node is out of memory.
Raise condition	`rabbitmq_node_mem_used >= rabbitmq_node_mem_limit`
Description	RabbitMQ uses all the allocated memory and blocks all connections that are publishing messages to prevent further usage growth. The `host` label in the raised alert contains the host name of the affected node. If other system services consume more RAM, the system may start to swap, which can cause the Erlang VM crash and RabbitMQ can go down on the node.
Troubleshooting	Increase paging for RabbitMQ through CLI. For example: rabbitmqctl -n rabbit@msg01 set_global_parameter \ vm_memory_high_watermark_paging_ratio 0.75 The service restart will reset this change. Add more memory on the affected node.
Tuning	Not required

RabbitmqMessagesTooHigh¶

Severity	Warning
Summary	The RabbitMQ service on the `{{$labels.host}}` node has received more than 2^20 messages.
Raise condition	`rabbitmq_overview_messages > 2^20`
Description	Raises when the quantity of messages received by RabbitMQ exceeds the warning limit, (by default, 1024 multiplied by 1024), typically indicating that some consumer may not peek messages from the queues.
Troubleshooting	Verify if huge queues are present using `rabbitmqctl list_queues`.
Tuning	For example, to change the disk warning threshold to `2^21`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqMessagesTooHigh: if: >- rabbitmq_overview_messages > 2^21 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsTooHigh¶

Severity	Critical ^{Major in 2019.2.4}
Summary	The rate of errors in RabbitMQ logs is more than 0.2 error messages per second on the `{{$labels.host}}` node as measured over 5 minutes.
Raise condition	In 2019.2.4: `sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error\|emergency\|fatal))"}[5m])) without (level) > 0.2` In 2019.2.5: `sum(rate(log_messages{service="rabbitmq",level=~"(?i:( error\|critical))"}[5m])) without (level) > 0.05`
Description	Raises when the average per-second rate of the `error`, `fatal`, or `emergency` messages in the RabbitMQ logs on the node is more than 0.2 per second. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the log files in the `/var/log/rabbitmq` directory on the affected node.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, inspect the RabbitMQ logs in the Kibana web UI. However, you can adjust the threshold to an acceptable error rate for a particular environment. In the Prometheus Web UI, use the raise condition query to view the appearance rate of a particular message type in logs for a longer period of time and define the best threshold. For example, to change the threshold to `0.4`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqErrorLogsTooHigh: if: >- sum(rate(log_messages{service=""rabbitmq"", level=~""(?i:\ (error\|emergency\|fatal))""}[5m])) without (level) > 0.4 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqErrorLogsMajor¶

^{Available starting from the 2019.2.5 maintenance update}

Severity	Major
Summary	The RabbitMQ logs on the `{{ $labels.host }}` node contain errors (as measured over the last 30 minutes).
Raise condition	`sum(increase(log_messages{service="rabbitmq",level=~"(?i:(error\| critical))"}[30m])) without (level) > 0`
Description	Raises when the `error` or `critical` messages appear in the RabbitMQ logs on a node. The `host` label in the raised alert contains the name of the affected node. Fluentd forwards all logs from RabbitMQ to Elasticsearch and counts the number of log messages per severity.
Troubleshooting	Inspect the log files in the `/var/log/rabbitmq` directory on the affected node.
Tuning	Not required

RabbitmqFdUsageWarning¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Warning
Summary	The RabbitMQ service uses `{{ $value }}%` of all available file descriptors on the `{{ $labels.host }}` node.
Raise condition	`rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 70`
Description	Raises when the RabbitMQ instance uses 70% of available file descriptors. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect `openstack/control.yml` in the cluster model to verify if the default value of the OpenStack service `rpc_workers` was overwritten. Decrease the `rpc_workers` value and apply the state for the corresponding service.
Tuning	For example, to change the threshold to `60%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqFdUsageWarning: if: >- rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 60 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RabbitmqFdUsageCritical¶

^{Available starting from the 2019.2.6 maintenance update}

Severity	Critical
Summary	The RabbitMQ service uses `{{ $value }}%` of all available file descriptors on the `{{ $labels.host }}` node.
Raise condition	`rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 95`
Description	Raises when the RabbitMQ instance uses 95% of available file descriptors, indicating that RabbitMQ is about to crash. Warning For production environments, configure the alert after deployment.
Troubleshooting	Inspect `openstack/control.yml` in the cluster model to verify if the default value of the OpenStack service `rpc_workers` was overwritten. Decrease the `rpc_workers` value and apply the state for the corresponding service.
Tuning	For example, to change the threshold to `87%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RabbitmqFdUsageCritical: if: >- rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 87 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

updated: 2025-01-10 08:56

Open vSwitch

View Previous Section

Reclass

RabbitMQ

RabbitMQ¶

RabbitmqServiceDown¶

RabbitmqServiceOutage¶

RabbitMQUnequalQueueCritical¶

RabbitmqDiskFullWarning¶

RabbitmqDiskFullCritical¶

RabbitmqMemoryLowWarning¶

RabbitmqMemoryLowCritical¶

RabbitmqMessagesTooHigh¶

RabbitmqErrorLogsTooHigh¶

RabbitmqErrorLogsMajor¶

RabbitmqFdUsageWarning¶

RabbitmqFdUsageCritical¶

View Previous Section

View Next Section