Open vSwitch

Open vSwitch

This section describes the alerts for the Open vSwitch (OVS) processes.

Warning

  • Monitoring of the OVS processes is available starting from the MCP 2019.2.3 update.

  • The OVSInstanceArpingCheckDown alert is available starting from the MCP 2019.2.4 update.

  • The OVSTooManyPortRunningOnAgent, OVSErrorOnPort, OVSNonInternalPortDown and OVSGatherFailed alerts are available starting from the MCP 2019.2.6 update.


ProcessOVSVswitchdMemoryWarning

Available starting from the 2019.2.3 maintenance update

Severity

Warning

Summary

The ovs-vswitchd process consumes more than 20% of system memory.

Raise condition

procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.2

Description

Raises when the virtual memory of the ovs-switchd process exceeds 20% of the host memory.

Tuning

Not required

ProcessOVSVswitchdMemoryCritical

Available starting from the 2019.2.3 maintenance update

Severity

Critical

Summary

The ovs-vswitchd process consumes more than 30% of system memory.

Raise condition

procstat_memory_vms{process_name="ovs-vswitchd"} / on(host) mem_total > 0.3

Description

Raises when the virtual memory of the ovs-switchd process exceeds 30% of the host memory.

Tuning

Not required

OVSInstanceArpingCheckDown

Available starting from the 2019.2.4 maintenance update

Severity

Major

Summary

The OVS instance arping check is down.

Raise condition

instance_arping_check_up == 0

Description

Raises when the OVS instance arping check on the {{ $labels.host }} node is down for 2 minutes. The host label in the raised alert contains the affected node name.

Tuning

Not required

OVSTooManyPortRunningOnAgent

Available starting from the 2019.2.6 maintenance update

Severity

Major

Summary

The number of OVS port is {{ $value }} (ovs-vsctl list port) on the {{ $labels.host }} host, which is more than the expected limit.

Raise condition

sum by (host) (ovs_bridge_status)  > 1500

Description

Raises when too many networks are created or OVS does not properly clean up the OVS ports. OVS may malfunction if too many ports are assigned to a single agent.

Warning

For production environments, configure the alert after deployment.

Troubleshooting

  • Run ovs-vsctl show from the affected node and openstack port list from the OpenStack controller nodes and inspect the existing ports.

  • Remove the unneeded ports or redistribute the OVS ports.

Tuning

For example, to change the threshold to 1600:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            OVSTooManyPortRunningOnAgent:
              if: >-
                sum by (host) (ovs_bridge_status)  > 1600
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

OVSErrorOnPort

Available starting from the 2019.2.6 maintenance update

Severity

Critical

Summary

The {{ $labels.port }} OVS port on the {{ $labels.bridge }} bridge running on the {{ $labels.host }} host is reporting errors.

Raise condition

ovs_bridge_status == 2

Description

Raises when an OVS port reports errors, indicating that the port is not working properly.

Troubleshooting

  1. From the affected node, run ovs-vsctl show.

  2. Inspect the output for error entries.

Tuning

Not required

OVSNonInternalPortDown

Available starting from the 2019.2.6 maintenance update

Severity

Critical

Summary

The {{ $labels.port }} OVS port on the {{ $labels.bridge }} bridge running on the {{ $labels.host }} host is down.

Raise condition

ovs_bridge_status{type!="internal"} == 0

Description

Raises when the port on the OVS bridge is in the DOWN state, which may lead to an unexpected network disturbance.

Troubleshooting

  1. From the affected node, run ip a to verify if the port is in the DOWN state.

  2. If required, bring the port up using ifconfig <interface> up.

Tuning

Note required

OVSGatherFailed

Available starting from the 2019.2.6 maintenance update

Severity

Critical

Summary

Failure to gather the OVS information on the {{ $labels.host }} host.

Raise condition

ovs_bridge_check == 0

Description

Raises when the check script for the OVS bridge fails to gather data. OVS is not monitored.

Troubleshooting

Run /usr/local/bin/ovs_parse_bridge.py from the affected host and inspect the output.

Tuning

Not required