RabbitMQ
This section describes the alerts for the RabbitMQ service.
RabbitmqServiceDown
Severity |
Critical |
Summary |
The RabbitMQ service on the {{$labels.host}} node is down. |
Raise condition |
rabbitmq_up == 0
|
Description |
Raises when the RabbitMQ service is down on one node, which affects the
RabbitMQ availability. The alert raises 1 minute after the issue
occurrence.
The host label in the raised alert contains
the host name of the affected node. |
Troubleshooting |
Verify the RabbitMQ status using systemctl status rabbitmq-server .
Inspect the RabbitMQ logs in /var/log/rabbitmq .
Verify that the node has enough resources, such as disk space or RAM.
|
Tuning |
Not required |
RabbitmqServiceOutage
Severity |
Critical |
Summary |
All RabbitMQ services are down. |
Raise condition |
count(rabbitmq_up == 0) == count(rabbitmq_up)
|
Description |
Raises when RabbitMQ is down on all nodes, indicating that the service
is unavailable. The alert raises 1 minute after the issue occurrence. |
Troubleshooting |
Verify the RabbitMQ status on the msg nodes using
systemctl status rabbitmq-server .
Inspect the RabbitMQ logs in the /var/log/rabbitmq directory on
the msg nodes.
Verify that the node has enough resources such as disk space or RAM.
|
Tuning |
Not required |
RabbitMQUnequalQueueCritical
Available starting from the 2019.2.5 maintenance update
Severity |
Critical |
Summary |
The RabbitMQ service has unequal number of queues across the cluster
instances. |
Raise condition |
max(rabbitmq_overview_queues) != min(rabbitmq_overview_queues)
|
Description |
Raises when the RabbitMQ cluster nodes have inconsistent number of
queues for 10 minutes. This issue can occur after service restart and
cause the inaccessibility of RabbitMQ. |
Troubleshooting |
Contact Mirantis support.
|
Tuning |
Not required |
RabbitmqDiskFullWarning
Severity |
Warning |
Summary |
The RabbitMQ service on the {{$labels.host}} node has less than
500 MB of free disk space. |
Raise condition |
rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 10
|
Description |
Rasies when the consumption of the available disk space by RabbitMQ
reaches 500 MB (by default, 10 multiplied by 50 MB). RabbitMQ checks the
available disk space more frequently as it shrinks, which can affect
system load. |
Troubleshooting |
Free or add more disk space on the affected node. |
Tuning |
To change the threshold to 15 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
RabbitmqDiskFullWarning:
if: >-
rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit * 15
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
RabbitmqDiskFullCritical
Severity |
Critical |
Summary |
The RabbitMQ disk space on the {{$labels.host}} node is full. |
Raise condition |
rabbitmq_node_disk_free <= rabbitmq_node_disk_free_limit
|
Description |
Raises when RabbitMQ uses all the available disk space (less than 50 MB
by default). The alert is cluster-wide. RabbitMQ blocks producers and
in-memory messages from paging to the disk. Frequent disk checks
contribute to load growth. The host label in the raised alert
contains the host name of the affected node. |
Troubleshooting |
Add more disk space on the affected node. |
Tuning |
Not required |
RabbitmqMemoryLowWarning
Severity |
Warning |
Summary |
The RabbitMQ service uses more than 80% of memory on the
{{$labels.host}} node for 2 minutes. |
Raise condition |
100 * rabbitmq_node_mem_used / rabbitmq_node_mem_limit >= 100 * 0.8
|
Description |
Raises when the RabbitMQ memory consumption reaches the warning
threshold (80% of allocated memory by default). The host label in
the raised alert contains the host name of the affected node. |
Troubleshooting |
Edit the high memory watermark in the service configuration.
Increase paging for RabbitMQ through CLI. For example:
rabbitmqctl -n rabbit@msg01 set_global_parameter \
vm_memory_high_watermark_paging_ratio 0.75
The service restart will reset this change.
Add more memory on the affected node.
|
Tuning |
To change the watermark:
On the cluster level of the Reclass model, specify the
vm_high_watermark parameter in openstack/message_queue.yml .
For example:
rabbitmq:
server:
memory:
vm_high_watermark: 0.8
From the Salt Master node, apply the following states one by one:
salt '*' saltutil.refresh_pillar
salt -C 'I@rabbitmq:server' state.sls rabbitmq.server
|
RabbitmqMemoryLowCritical
Severity |
Critical |
Summary |
The RabbitMQ service on the {{$labels.host}} node is out of memory. |
Raise condition |
rabbitmq_node_mem_used >= rabbitmq_node_mem_limit
|
Description |
RabbitMQ uses all the allocated memory and blocks all connections that
are publishing messages to prevent further usage growth. The host
label in the raised alert contains the host name of the affected node.
If other system services consume more RAM, the system may start to swap,
which can cause the Erlang VM crash and RabbitMQ can go down on the
node. |
Troubleshooting |
Increase paging for RabbitMQ through CLI. For example:
rabbitmqctl -n rabbit@msg01 set_global_parameter \
vm_memory_high_watermark_paging_ratio 0.75
The service restart will reset this change.
Add more memory on the affected node.
|
Tuning |
Not required |
RabbitmqMessagesTooHigh
Severity |
Warning |
Summary |
The RabbitMQ service on the {{$labels.host}} node has received more
than 2^20 messages. |
Raise condition |
rabbitmq_overview_messages > 2^20
|
Description |
Raises when the quantity of messages received by RabbitMQ exceeds the
warning limit, (by default, 1024 multiplied by 1024), typically
indicating that some consumer may not peek messages from the queues. |
Troubleshooting |
Verify if huge queues are present using rabbitmqctl list_queues . |
Tuning |
For example, to change the disk warning threshold to 2^21 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
RabbitmqMessagesTooHigh:
if: >-
rabbitmq_overview_messages > 2^21
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
RabbitmqErrorLogsTooHigh
Severity |
Critical Major in 2019.2.4 |
Summary |
The rate of errors in RabbitMQ logs is more than 0.2 error messages per
second on the {{$labels.host}} node as measured over 5 minutes. |
Raise condition |
In 2019.2.4: sum(rate(log_messages{service="rabbitmq",level=~"(?i:(
error|emergency|fatal))"}[5m])) without (level) > 0.2
In 2019.2.5: sum(rate(log_messages{service="rabbitmq",level=~"(?i:(
error|critical))"}[5m])) without (level) > 0.05
|
Description |
Raises when the average per-second rate of the error , fatal , or
emergency messages in the RabbitMQ logs on the node is more than
0.2 per second. Fluentd forwards all logs from RabbitMQ to Elasticsearch
and counts the number of log messages per severity. The host label
in the raised alert contains the host name of the affected node. |
Troubleshooting |
Inspect the log files in the /var/log/rabbitmq directory on the
affected node. |
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, inspect the RabbitMQ logs in the Kibana web UI.
However, you can adjust the threshold to an acceptable error rate for a
particular environment. In the Prometheus Web UI, use the raise
condition query to view the appearance rate of a particular message type
in logs for a longer period of time and define the best threshold.
For example, to change the threshold to 0.4 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
RabbitmqErrorLogsTooHigh:
if: >-
sum(rate(log_messages{service=""rabbitmq"", level=~""(?i:\
(error|emergency|fatal))""}[5m])) without (level) > 0.4
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
RabbitmqErrorLogsMajor
Available starting from the 2019.2.5 maintenance update
Severity |
Major |
Summary |
The RabbitMQ logs on the {{ $labels.host }} node contain errors (as
measured over the last 30 minutes). |
Raise condition |
sum(increase(log_messages{service="rabbitmq",level=~"(?i:(error|
critical))"}[30m])) without (level) > 0
|
Description |
Raises when the error or critical messages appear in the
RabbitMQ logs on a node. The host label in the raised alert contains
the name of the affected node. Fluentd forwards all logs from RabbitMQ
to Elasticsearch and counts the number of log messages per severity. |
Troubleshooting |
Inspect the log files in the /var/log/rabbitmq directory on the
affected node. |
Tuning |
Not required |
RabbitmqFdUsageWarning
Available starting from the 2019.2.6 maintenance update
Severity |
Warning |
Summary |
The RabbitMQ service uses {{ $value }}% of all available file
descriptors on the {{ $labels.host }} node. |
Raise condition |
rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 70
|
Description |
Raises when the RabbitMQ instance uses 70% of available file
descriptors.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Inspect openstack/control.yml in the cluster model to verify if
the default value of the OpenStack service rpc_workers was
overwritten.
Decrease the rpc_workers value and apply the state for the
corresponding service.
|
Tuning |
For example, to change the threshold to 60% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
RabbitmqFdUsageWarning:
if: >-
rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 60
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
RabbitmqFdUsageCritical
Available starting from the 2019.2.6 maintenance update
Severity |
Critical |
Summary |
The RabbitMQ service uses {{ $value }}% of all available file
descriptors on the {{ $labels.host }} node. |
Raise condition |
rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 95
|
Description |
Raises when the RabbitMQ instance uses 95% of available file
descriptors, indicating that RabbitMQ is about to crash.
Warning
For production environments, configure the alert after
deployment.
|
Troubleshooting |
Inspect openstack/control.yml in the cluster model to verify if
the default value of the OpenStack service rpc_workers was
overwritten.
Decrease the rpc_workers value and apply the state for the
corresponding service.
|
Tuning |
For example, to change the threshold to 87% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
RabbitmqFdUsageCritical:
if: >-
rabbitmq_node_fd_used / rabbitmq_node_fd_total * 100 >= 87
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|