Troubleshot RabbitMQ non-functional queue bindings

Troubleshot RabbitMQ non-functional queue bindings

Caution

The following procedure has been tested with clustered RabbitMQ series 3.8. For the RabbitMQ series 3.6, as well as for a nonclustered RabbitMQ, you may need to adjust the procedure.

Note

This feature is available starting from the MCP 2019.2.15 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

In case of network hiccups or instability, a RabbitMQ cluster may not pass messages between some exchanges and their respective queues even if the corresponding bindings are present and visible in the RabbitMQ Management Plugin UI. Due to concurrent handling of self-healing actions and OpenStack clients recreating queues during reconnection, some bindings that route messages from exchanges to queues may be created dysfunctional. This issue also applies to queue policies. For example, queues can be created without the needed policies, which can lead to an unexpected behavior.

This section describes how to identify and fix dysfunctional bindings. The solution includes a helper script that sends test messages to exchanges with the identified bindings. If a message cannot be delivered with a particular binding, the script prints out a message about a potentially dysfunctional binding.

Warning

Perform the following procedure only if the RabbitMQ cluster is healthy, the self-healing is over, and cluster_status indicates that the cluster is running all the nodes it was configured with. To verify the RabbitMQ cluster health, run the following command:

rabbitmqctl cluster_status

If the output indicates that the cluster is running less nodes than it was configured with and shows active partitions that will not be dismissed, or if the command errors out showing that the RabbitMQ process is not running, proceed with Restart RabbitMQ with clearing the Mnesia database instead.

To identify and fix RabbitMQ dysfunctional queues:

  1. Log in to the msg01 node.

  2. Download the script:

    wget "https://review.fuel-infra.org/gitweb?p=tools/sustaining.git;a=blob_plain;f=scripts/rabbitmq_check.py;hb=HEAD" -O ./rabbitmq_check.py
    
  3. Obtain the RabbitMQ admin credentials:

    salt-call -l quiet pillar.items rabbitmq:server:admin
    
  4. Obtain the IP address where the RabbitMQ Management Plugin is listening for incoming connections:

    ss -lnt '( sport = :15672 )'
    
  5. Run the script replacing %IP% and %PASSWORD% with the previously obtained parameters:

    python3 ./rabbitmq_check.py -u admin -p %PASSWORD% -H %IP% -V "/openstack" check
    

    If the script returns output with Possibly unreachable binding […] …, discrepancies have been found. Proceed with the steps below.

  6. Restart the service that runs the particular exchange-queue pair. For example, if a conductor binding is faulty, restart the nova-conductor service on all OpenStack controller nodes.

  7. If the service restart did not fix the issue, drop the state of exchanges and queues:

    1. Log in to the Salt Master node.

    2. Disconnect the clients from RabbitMQ to remove the interference with inter-cluster operations:

      salt -C 'I@rabbitmq:server' cmd.run 'iptables -I INPUT -p tcp --dport 5671:5672 -j REJECT'
      
    3. Stop the RabbitMQ application without shutting down the entire Erlang machine:

      salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl stop_app'
      
    4. Wait for 1-2 minutes and then start the RabbitMQ application:

      salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl start_app"'
      
    5. Verify that the RabbitMQ cluster is healthy:

      salt -C 'I@rabbitmq:server' cmd.run 'rabbitmqctl cluster_status'
      
    6. Remove the iptables rules to allow clients to connect back to RabbitMQ:

      salt -C 'I@rabbitmq:server' cmd.run 'iptables -D INPUT -p tcp --dport 5671:5672 -j REJECT'
      

If none of the approaches fixes the issue, proceed with Restart RabbitMQ with clearing the Mnesia database.