Insufficient OVS timeouts causing instance traffic losses

Insufficient OVS timeouts causing instance traffic losses

If you receive the OVS timeout errors in the neutron-openvswitch-agent logs, such as ofctl request <...> timed out: Timeout: 10 seconds or Commands [<ovsdbap...>] exceeded timeout 10 seconds, you can configure the OVS timeout parameters as required depending on the number of the OVS ports on the gtw in your cloud. For example, if you have more than 1000 ports per a gtw node, Mirantis recommends changing the OVS timeouts as described below. The same procedure can be applied to the compute nodes if required.

Warning

This feature is available starting from the MCP 2019.2.3 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

To increase OVS timeouts on the gateway nodes:

  1. Log in to the Salt Master node.

  2. Open /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/gateway.yml for editing.

  3. Add the following snippet to the parameters section of the file with the required values.

    neutron:
     gateway:
       of_connect_timeout: 60
       of_request_timeout: 30
       ovs_vsctl_timeout: 30  # Pike
       ovsdb_timeout: 30  # Queens and beyond
    
  4. Apply the following state:

    salt -C 'I@neutron:gateway' state.sls neutron
    
  5. Verify whether the Open vSwitch logs contain the Datapath Invalid and no response to inactivity probe errors:

    • In the neutron-openvswitch-agent logs, for example:

      ERROR ... ofctl request <...> error Datapath Invalid 64183592930369: \
      InvalidDatapath: Datapath Invalid 64183592930369
      
    • In openvswitch/ovs-vswitchd.log, for example:

      ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe \
      after 5 seconds, disconnecting
      

    If the logs contain such errors, increase inactivity probes for the OVS bridge controllers and OVS manager:

    1. Log in to any gtw node.

    2. Run the following commands:

      ovs-vsctl set controller br-int inactivity_probe=60000
      ovs-vsctl set controller br-tun inactivity_probe=60000
      ovs-vsctl set controller br-floating inactivity_probe=60000
      
    3. Identify the OVS manager ID:

      ovs-vsctl list manager
      
    4. Run the following command:

      ovs-vsctl set manager <ovs_manager_id> inactivity_probe=30000
      

To increase OVS timeouts on the compute nodes:

  1. Log in to the Salt Master node.

  2. Open /srv/salt/reclass/classes/cluster/<cluster_name>/openstack/compute.yml for editing.

  3. Add the following snippet to the parameters section of the file with the required values.

    neutron:
     compute:
       of_connect_timeout: 60
       of_request_timeout: 30
       ovs_vsctl_timeout: 30  # Pike
       ovsdb_timeout: 30  # Queens and beyond
    
  4. Apply the following state:

    salt -C 'I@neutron:compute' state.sls neutron
    
  5. Verify whether the Open vSwitch logs contain the Datapath Invalid and no response to inactivity probe errors:

    • In the neutron-openvswitch-agent logs, for example:

      ERROR ... ofctl request <...> error Datapath Invalid 64183592930369: \
      InvalidDatapath: Datapath Invalid 64183592930369
      
    • In openvswitch/ovs-vswitchd.log, for example:

      ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe \
      after 5 seconds, disconnecting
      

    If the logs contain such errors, increase inactivity probes for the OVS bridge controllers and OVS manager:

    1. Log in to the target cmp node.

    2. Run the following commands:

      ovs-vsctl set controller br-int inactivity_probe=60000
      ovs-vsctl set controller br-tun inactivity_probe=60000
      ovs-vsctl set controller br-floating inactivity_probe=60000
      
    3. Identify the OVS manager ID:

      ovs-vsctl list manager
      
    4. Run the following command:

      ovs-vsctl set manager <ovs_manager_id> inactivity_probe=30000