Keepalived

Keepalived


KeepalivedProcessDown

Severity

Major

Summary

The Keepalived process on the {{ $labels.host }} node is down.

Raise condition

procstat_running{process_name="keepalived"} == 0

Description

Raised when Keepalived on a particular host does not respond Telegraf, typically indicating that Keepalived is down. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the Keepalived status on the affected node using systemctl status keepalived.

  • Inspect the Keepalived logs on the affected node using journalctl -u keepalived.

  • Inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required

KeepalivedProcessNotResponsive

Severity

Major

Summary

The Keepalived process on the {{ $labels.host }} node is not responding.

Raise condition

keepalived_up == 0

Description

Raises when Keepalived on a particular host does not respond to Telegraf, typically indicating that Keepalived is running but is not responsive on that node. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the Keepalived status on the affected node using service keepalived status.

  • Inspect the Keepalived logs on the affected node using journalctl -u keepalived.

  • Inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required

KeepalivedFailedState

Severity

Minor

Summary

The Keepalived VRRP {{ $labels.name }} is in the FAILED state on the {{ $labels.host }} node.

Raise condition

keepalived_state == 0

Description

Raises when the Keepalived Virtual Router Redundancy Protocol (VRRP) is in the FAILED state on a node, typically indicating network issues. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Inspect the Keepalived logs on the affected node using journalctl -u keepalived.

  • Inspect the Telegraf logs on the affected node using journalctl -u telegraf.

  • Inspect the affected node for any network issues.

Tuning

Not required

KeepalivedUnknownState

Severity

Minor

Summary

The Keepalived VRRP {{ $labels.name }} is in the UNKNOWN state on the {{ $labels.host }} node.

Raise condition

keepalived_state == -1

Description

Raises when the Keepalived Virtual Router Redundancy Protocol (VRRP) is in the UNKNOWN state on a node, typically indicating that Keepalived has improperly reported its state or Telegraf cannot gather the state. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Inspect the Keepalived logs on the affected node using journalctl -u keepalived.

  • Inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required

KeepalivedMultipleIPAddr

Severity

Major

Summary

The Keepalived {{ $labels.ip }} virtual IP is assigned more than once.

Raise condition

count(ipcheck_assigned) by (ip) > 1

Description

Raises when the virtual IP address (VIP) of Keepalived is assigned more than once (on more than one node within a cluster).

Troubleshooting

On each node of the Keepalived cluster, ctl nodes by default, verify if the VIP is assigned on two or more nodes or interfaces using the ip a | grep VIP_address command.

Tuning

Not required

KeepalivedServiceOutage

Severity

Critical

Summary

All Keepalived processes within the {{ $labels.cluster}} cluster are down.

Raise condition

count(label_replace(procstat_running{process_name="keepalived"}, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(procstat_running{process_name="keepalived"} == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)

Description

Raises when all Keepalived services across the cluster do not respond to Telegraf, typically indicating configuration or deployment issues.

Troubleshooting

  • Inspect the KeepalivedProcessDown alerts for the host names of the affected nodes.

  • Inspect the Keepalived logs on the affected nodes using journalctl -u keepalived.

  • Inspect the Telegraf logs on the affected nodes using journalctl -u telegraf.

Tuning

Not required