Open vSwitch

Open vSwitch

This section describes the alerts for the Open vSwitch (OVS) processes.

Warning

  • Monitoring of the OVS processes is available starting from the MCP 2019.2.3 update.
  • The OVSInstanceArpingCheckDown alert is available starting from the MCP 2019.2.4 update.
  • The OVSTooManyPortRunningOnAgent, OVSErrorOnPort, OVSNonInternalPortDown and OVSGatherFailed alerts are available starting from the MCP 2019.2.6 update.

ProcessOVSVswitchdMemoryWarning

Available starting from the 2019.2.3 maintenance update

Severity Warning
Summary The ovs-vswitchd process consumes more than 20% of system memory.
Raise condition procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.2
Description Raises when the virtual memory of the ovs-switchd process exceeds 20% of the host memory.
Tuning Not required

ProcessOVSVswitchdMemoryCritical

Available starting from the 2019.2.3 maintenance update

Severity Critical
Summary The ovs-vswitchd process consumes more than 30% of system memory.
Raise condition procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.3
Description Raises when the virtual memory of the ovs-switchd process exceeds 30% of the host memory.
Tuning Not required

OVSInstanceArpingCheckDown

Available starting from the 2019.2.4 maintenance update

Severity Major
Summary The OVS instance arping check is down.
Raise condition instance_arping_check_up == 0
Description Raises when the OVS instance arping check on the {{ $labels.host }} node is down for 2 minutes. The host label in the raised alert contains the affected node name.
Tuning Not required

OVSTooManyPortRunningOnAgent

Available starting from the 2019.2.6 maintenance update

Severity Major
Summary The number of OVS port is {{ $value }} (ovs-vsctl list port) on the {{ $labels.host }} host, which is more than the expected limit.
Raise condition sum by (host) (ovs_bridge_status)  > 1500
Description

Raises when too many networks are created or OVS does not properly clean up the OVS ports. OVS may malfunction if too many ports are assigned to a single agent.

Warning

For production environments, configure the alert after deployment.

Troubleshooting
  • Run ovs-vsctl show from the affected node and openstack port list from the OpenStack controller nodes and inspect the existing ports.
  • Remove the unneeded ports or redistribute the OVS ports.
Tuning

For example, to change the threshold to 1600:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            OVSTooManyPortRunningOnAgent:
              if: >-
                sum by (host) (ovs_bridge_status)  > 1600
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

OVSErrorOnPort

Available starting from the 2019.2.6 maintenance update

Severity Critical
Summary The {{ $labels.port }} OVS port on the {{ $labels.bridge }} bridge running on the {{ $labels.host }} host is reporting errors.
Raise condition ovs_bridge_status == 2
Description Raises when an OVS port reports errors, indicating that the port is not working properly.
Troubleshooting
  1. From the affected node, run ovs-vsctl show.
  2. Inspect the output for error entries.
Tuning Not required

OVSNonInternalPortDown

Available starting from the 2019.2.6 maintenance update

Severity Critical
Summary The {{ $labels.port }} OVS port on the {{ $labels.bridge }} bridge running on the {{ $labels.host }} host is down.
Raise condition ovs_bridge_status{type!="internal"} == 0
Description Raises when the port on the OVS bridge is in the DOWN state, which may lead to an unexpected network disturbance.
Troubleshooting
  1. From the affected node, run ip a to verify if the port is in the DOWN state.
  2. If required, bring the port up using ifconfig <interface> up.
Tuning Note required

OVSGatherFailed

Available starting from the 2019.2.6 maintenance update

Severity Critical
Summary Failure to gather the OVS information on the {{ $labels.host }} host.
Raise condition ovs_bridge_check == 0
Description Raises when the check script for the OVS bridge fails to gather data. OVS is not monitored.
Troubleshooting Run /usr/local/bin/ovs_parse_bridge.py from the affected host and inspect the output.
Tuning Not required