Caution
The following procedure has been tested with clustered RabbitMQ series 3.8. For the RabbitMQ series 3.6, as well as for a nonclustered RabbitMQ, you may need to adjust the procedure.
Note
This feature is available starting from the MCP 2019.2.15 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.
In case of network hiccups or instability, a RabbitMQ cluster may not pass messages between some exchanges and their respective queues even if the corresponding bindings are present and visible in the RabbitMQ Management Plugin UI. Due to concurrent handling of self-healing actions and OpenStack clients recreating queues during reconnection, some bindings that route messages from exchanges to queues may be created dysfunctional. This issue also applies to queue policies. For example, queues can be created without the needed policies, which can lead to an unexpected behavior.
This section describes how to identify and fix dysfunctional bindings. The solution includes a helper script that sends test messages to exchanges with the identified bindings. If a message cannot be delivered with a particular binding, the script prints out a message about a potentially dysfunctional binding.
Warning
Perform the following procedure only if the RabbitMQ cluster is
healthy, the self-healing is over, and cluster_status
indicates that the
cluster is running all the nodes it was configured with. To verify the
RabbitMQ cluster health, run the following command:
rabbitmqctl cluster_status
If the output indicates that the cluster is running less nodes than it was configured with and shows active partitions that will not be dismissed, or if the command errors out showing that the RabbitMQ process is not running, proceed with Restart RabbitMQ with clearing the Mnesia database instead.
To identify and fix RabbitMQ dysfunctional queues:
Log in to the msg01
node.
Download the script:
wget "https://review.fuel-infra.org/gitweb?p=tools/sustaining.git;a=blob_plain;f=scripts/rabbitmq_check.py;hb=HEAD" -O ./rabbitmq_check.py
Obtain the RabbitMQ admin credentials:
salt-call -l quiet pillar.items rabbitmq:server:admin
Obtain the IP address where the RabbitMQ Management Plugin is listening for incoming connections:
ss -lnt '( sport = :15672 )'
Run the script replacing %IP%
and %PASSWORD%
with the previously
obtained parameters:
python3 ./rabbitmq_check.py -u admin -p %PASSWORD% -H %IP% -V "/openstack" check
If the script returns output with Possibly unreachable binding […] …, discrepancies have been found. Proceed with the steps below.
Restart the service that runs the particular exchange-queue pair. For
example, if a conductor
binding is faulty, restart the
nova-conductor
service on all OpenStack controller nodes.
If the service restart did not fix the issue, drop the state of exchanges and queues:
Log in to the Salt Master node.
Disconnect the clients from RabbitMQ to remove the interference with inter-cluster operations:
salt -C 'I@rabbitmq:server' cmd.run 'iptables -I INPUT -p tcp --dport 5671:5672 -j REJECT'
Stop the RabbitMQ application without shutting down the entire Erlang machine:
salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl stop_app'
Wait for 1-2 minutes and then start the RabbitMQ application:
salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl start_app"'
Verify that the RabbitMQ cluster is healthy:
salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl cluster_status'
Remove the iptables
rules to allow clients to connect back to RabbitMQ:
salt -C 'I@rabbitmq:server' cmd.run 'iptables -D INPUT -p tcp --dport 5671:5672 -j REJECT'
If none of the approaches fixes the issue, proceed with Restart RabbitMQ with clearing the Mnesia database.