Switch to nonclustered RabbitMQ

Switch to nonclustered RabbitMQΒΆ

This section instructs you on how to switch clustered RabbitMQ to a nonclustered configuration.

Note

This feature is available starting from the MCP 2019.2.13 maintenance update. Before using the feature, follow the steps described in Apply maintenance updates.

Caution

  • This feature is available for OpenStack Queens and from the RabbitMQ version 3.8.2.
  • The procedure below applies only to environments without manual changes in the configuration files of OpenStack services. The procedure applies all OpenStack states to all OpenStack nodes and implies that any state can apply without any errors before starting the maintenance.

To switch RabbitMQ to a nonclustered configuration:

  1. Perform the following prerequisite steps:

    1. Log in to the Salt Master node.

    2. Verify that the salt-formula-nova version is 2016.12.1+202101271624.d392d41~xenial1 or newer:

      dpkg -l |grep salt-formula-nova
      
    3. Verify that the salt-formula-oslo-templates version is 2018.1+202101191343.e24fd64~xenial1 or newer.

      dpkg -l |grep salt-formula-oslo-templates
      
    4. Create /root/non-clustered-rabbit-helpers.sh with the following content:

      #!/bin/bash
      
      # Apply all known openstack states on given target
      # example: run_openstack_states ctl*
      function run_openstack_states {
        local target="$1"
        all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
        #List of nodes in cloud
        list_nodes=`salt -C "$target" test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
        for node in $list_nodes; do
          #List of applications on the given node
          node_applications=$(salt $node pillar.items __reclass__:applications --out=json | jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
          for component in $all_formulas ; do
            if [[ " ${node_applications[*]} " == *"$component"* ]]; then
              echo "Applying state: $component on the $node"
              salt $node state.apply $component
            fi
          done
        done
      }
      
      # Apply specified update state for all OpenStack applications on given target
      # example: run_openstack_update_states ctl0* upgrade.verify
      # will run {nova|glance|cinder|keystone}.upgrade.verify on ctl01
      function run_openstack_update_states {
        local target="$1"
        local state="$2"
        all_formulas=$(salt-call config.get orchestration:upgrade:applications --out=json | jq '.[] | . as $in | keys_unsorted | map ({"key": ., "priority": $in[.].priority}) | sort_by(.priority) | map(.key | [(.)]) | add' | sed -e 's/"//g' -e 's/,//g' -e 's/\[//g' -e 's/\]//g')
        #List of nodes in cloud
        list_nodes=`salt -C "$target" test.ping --out=text | cut -d: -f1 | tr '\n' ' '`
        for node in $list_nodes; do
          #List of applications on the given node
          node_applications=$(salt $node pillar.items __reclass__:applications --out=json | jq 'values |.[] | values |.[] | .[]' | tr -d '"' | tr '\n' ' ')
          for component in $all_formulas ; do
            if [[ " ${node_applications[*]} " == *"$component"* ]]; then
              echo "Applying state: $component.${state} on the $node"
              salt $node state.apply $component.${state}
            fi
          done
        done
      }
      
    5. Run simple API checks for ctl01*. The output should not include errors.

      . /root/non-clustered-rabbit-helpers.sh
      run_openstack_update_states ctl01* upgrade.verify
      
    6. Open your project Git repository with the Reclass model on the cluster level.

    7. Prepare the Neutron server for the RabbitMQ reconfiguration:

      1. In openstack/control.yml, specify the allow_automatic_dhcp_failover parameter as required.

        Caution

        If set to true, the server reschedules the nets from the failed DHCP agents so that the alive agents catch up the net and serve DHCP. Once the agent reconnects to RabbitMQ, it detects that its net has been rescheduled and removes the DHCP port, namespace, and flows. This parameter is useful if the entire gateway node goes down. In case of an unstable RabbitMQ, agents do not go down and the data plane is not affected. Therefore, we recommend that you set the allow_automatic_dhcp_failover parameter to false. However, consider the risks of a gateway node going down before setting the allow_automatic_dhcp_failover parameter.

        neutron:
          server:
            allow_automatic_dhcp_failover: false
        
      2. Apply the changes:

        salt -C 'I@neutron:server' state.apply neutron.server
        
      3. Verify the changes:

        salt -C 'I@neutron:server' cmd.run "grep allow_automatic_dhcp_failover /etc/neutron/neutron.conf"
        
  2. Perform the following changes in the Reclass model on the cluster level:

    1. In infra/init.yml, add the following variable:

      parameters:
        _param:
          openstack_rabbitmq_standalone_mode: true
      
    2. In openstack/message_queue.yml, comment the following class:

      classes:
      #- system.rabbitmq.server.cluster
      
    3. In openstack/message_queue.yml, add the following classes:

      classes:
      - system.keepalived.cluster.instance.rabbitmq_vip
      - system.rabbitmq.server.single
      
    4. If your deployment has OpenContrail, add the following variables:

      1. In opencontrail/analytics.yml, add:

        parameters:
          opencontrail:
            collector:
              message_queue:
                ~members:
                  - host: ${_param:openstack_message_queue_address}
        
      2. In opencontrail/control.yml, add:

        parameters:
          opencontrail:
            config:
              message_queue:
                ~members:
                  - host: ${_param:openstack_message_queue_address}
            control:
              message_queue:
                ~members:
                  - host: ${_param:openstack_message_queue_address}
        
    5. To update the cells database when running Nova states, add the following variable to openstack/control.yml:

      parameters:
        nova:
          controller:
            update_cells: true
      
    6. Refresh pillars on all nodes:

      salt '*' saltutil.sync_all; salt '*' saltutil.refresh_pillar
      
  3. Verify that the messaging variables are set correctly:

    Note

    The following validation highlights the output for core OpenStack services only. Validate any additional deployed services appropriately.

    1. For Keystone:

      salt -C 'I@keystone:server' pillar.items keystone:server:message_queue:use_vip_address keystone:server:message_queue:host
      
    2. For Heat:

      salt -C 'I@heat:server' pillar.items heat:server:message_queue:use_vip_address heat:server:message_queue:host
      
    3. For Cinder:

      salt -C 'I@cinder:controller' pillar.items cinder:controller:message_queue:use_vip_address cinder:controller:message_queue:host
      
    4. For Glance:

      salt -C 'I@glance:server' pillar.items glance:server:message_queue:use_vip_address glance:server:message_queue:host
      
    5. For Nova:

      salt -C 'I@nova:controller' pillar.items nova:controller:message_queue:use_vip_address nova:controller:message_queue:host
      
    6. For the OpenStack compute nodes:

      salt -C 'I@nova:compute' pillar.items nova:compute:message_queue:use_vip_address nova:compute:message_queue:host
      
    7. For Neutron:

      salt -C 'I@neutron:server' pillar.items neutron:server:message_queue:use_vip_address neutron:server:message_queue:host
      salt -C 'I@neutron:gateway' pillar.items neutron:gateway:message_queue:use_vip_address neutron:gateway:message_queue:host
      
    8. If your deployment has OpenContrail:

      salt 'ntw01*' pillar.items opencontrail:config:message_queue:members opencontrail:control:message_queue:members
      salt 'nal01*' pillar.items opencontrail:collector:message_queue:members
      
  4. Apply the changes:

    1. Stop the OpenStack control plane services on the ctl nodes:

      . /root/non-clustered-rabbit-helpers.sh
      run_openstack_update_states ctl* upgrade.service_stopped
      
    2. Stop the OpenStack services on the gtw nodes. Skip this step if your deployment has OpenContrail or does not have gtw nodes.

      . /root/non-clustered-rabbit-helpers.sh
      run_openstack_update_states gtw* upgrade.service_stopped
      
    3. Reconfigure the Keepalived and RabbitMQ clusters on the msg nodes:

      1. Verify that the rabbitmq:cluster pillars are not present:

        salt -C 'I@rabbitmq:server' pillar.items rabbitmq:cluster
        
      2. Verify that the haproxy pillars are not present:

        salt -C 'I@rabbitmq:server' pillar.item haproxy
        
      3. Remove HAProxy, HAProxy monitoring, and reconfigure Keepalived:

        salt -C 'I@rabbitmq:server' cmd.run "export DEBIAN_FRONTEND=noninteractive; apt purge haproxy -y"
        salt -C 'I@rabbitmq:server' state.apply telegraf
        salt -C 'I@rabbitmq:server' state.apply keepalived
        
      4. Verify that a VIP address is present on one of the msg nodes:

        OPENSTCK_MSG_Q_ADDRESS=$(salt msg01* pillar.items _param:openstack_message_queue_address --out json|jq '.[][]')
        salt -C 'I@rabbitmq:server' cmd.run "ip addr |grep $OPENSTCK_MSG_Q_ADDRESS"
        
      5. Stop the RabbitMQ server, clear mnesia, and reconfigure rabbitmq-server:

        salt -C 'I@rabbitmq:server' cmd.run 'systemctl stop rabbitmq-server'
        salt -C 'I@rabbitmq:server' cmd.run 'rm -rf /var/lib/rabbitmq/mnesia/'
        salt -C 'I@rabbitmq:server' state.apply rabbitmq
        
      6. Verify that the RabbitMQ server is running in a nonclustered configuration:

        salt -C 'I@rabbitmq:server' cmd.run "rabbitmqctl --formatter=erlang cluster_status |grep running_nodes"
        

        Example of system response:

        msg01.heat-cicd-queens-dvr-sl.local:
             {running_nodes,[rabbit@msg01]},
        msg03.heat-cicd-queens-dvr-sl.local:
             {running_nodes,[rabbit@msg03]},
        msg02.heat-cicd-queens-dvr-sl.local:
             {running_nodes,[rabbit@msg02]},
        
    4. Reconfigure OpenStack services on the ctl nodes:

      1. Apply all OpenStack states on ctl nodes:

        . /root/non-clustered-rabbit-helpers.sh
        run_openstack_states ctl*
        
      2. Verify transport_url for the OpenStack services on the ctl nodes:

        salt 'ctl*' cmd.run "for s in nova glance cinder keystone heat neutron; do if [[ -d "/etc/\$s" ]]; then grep ^transport_url /etc/\$s/*.conf; fi; done" shell=/bin/bash
        
      3. Verify that the cells database is updated and transport_url has a VIP address:

        salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; nova-manage cell_v2 list_cells"
        
    5. Reconfigure RabbitMQ on the gtw nodes. Skip this step if your deployment has OpenContrail or does not have gtw nodes.

      1. Apply all OpenStack states on the gtw nodes:

        . /root/non-clustered-rabbit-helpers.sh
        run_openstack_states gtw*
        
      2. Verify transport_url for the OpenStack services on the gtw nodes:

        salt 'gtw*' cmd.run "for s in nova glance cinder keystone heat neutron; do if [[ -d "/etc/\$s" ]]; then grep ^transport_url /etc/\$s/*.conf; fi; done" shell=/bin/bash
        
      3. Verify that the agents are up:

        salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack orchestration service list"
        
    6. If your deployment has OpenContrail, reconfigure RabbitMQ on the ntw and nal nodes:

      1. Apply the following state on the ntw and nal nodes:

        salt -C 'ntw* or nal*' state.apply opencontrail
        
      2. Verify transport_url for the OpenStack services on the ntw and nal nodes:

        salt -C 'ntw* or nal*' cmd.run "for s in contrail; do if [[ -d "/etc/\$s" ]]; then grep ^rabbitmq_server_list /etc/\$s/*.conf; fi; done" shell=/bin/bash
        salt 'ntw*' cmd.run "for s in contrail; do if [[ -d "/etc/\$s" ]]; then grep ^rabbit_server /etc/\$s/*.conf; fi; done" shell=/bin/bash
        
      3. Verify the OpenContrail status:

        salt -C 'ntw* or nal*' cmd.run 'doctrail all contrail-status'
        
    7. Reconfigure OpenStack services on the cmp nodes:

      1. Apply all OpenStack states on the cmp nodes:

        . /root/non-clustered-rabbit-helpers.sh
        run_openstack_states cmp*
        
      2. Verify transport_url for the OpenStack services on the cmp nodes:

        salt 'cmp*' cmd.run "for s in nova glance cinder keystone heat neutron; do if [[ -d "/etc/\$s" ]]; then grep ^transport_url /etc/\$s/*.conf; fi; done" shell=/bin/bash
        

    Caution

    If your deployment has other nodes with OpenStack services, apply the changes on such nodes as well using the required states.

  5. Verify the services:

    1. Verify that the Neutron services are up. Skip this step if your deployment has OpenContrail.

      salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack network agent list"
      
    2. Verify that the Nova services are up:

      salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack compute service list"
      
    3. Verify that Heat services are up:

      salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack orchestration service list"
      
    4. Verify that the Cinder services are up:

      salt -C 'I@nova:controller and *01*' cmd.run ". /root/keystonercv3; openstack volume service list"
      
    5. From the ctl01* node, apply the <app>.upgrade.verify state. The output should not include errors.

      . /root/non-clustered-rabbit-helpers.sh
      run_openstack_update_states ctl01* upgrade.verify
      
  6. Perform post-configuration steps:

    1. Disable the RabbitMQUnequalQueueCritical Prometheus alert:

      1. In stacklight/server.yml, add the following variable:

        parameters:
          prometheus:
            server:
              alert:
                RabbitMQUnequalQueueCritical:
                  enabled: false
        
      2. Apply the Prometheus state to the mon nodes:

        salt -C 'I@docker:swarm and I@prometheus:server' state.sls prometheus.server -b1
        
    2. Revert the changes in the Reclass model on the cluster level:

      1. In openstack/control.yaml, set allow_automatic_dhcp_failover back to true or leave as is if you did not change the value.
      2. In openstack/control.yaml, remove nova:controller:update_cells:true.
    3. Apply the Neutron state:

      salt -C 'I@neutron:server' state.apply neutron.server
      
    4. Verify the changes:

      salt -C 'I@neutron:server' cmd.run "grep allow_automatic_dhcp_failover /etc/neutron/neutron.conf"
      
    5. Remove the script:

      rm -f /root/non-clustered-rabbit-helpers.sh