ZooKeeper

ZooKeeper

This section describes the alerts for the ZooKeeper service.


ZookeeperServiceDown

Severity Minor
Summary The ZooKeeper service on the {{ $labels.host }} node is down for 2 minutes.
Raise condition zookeeper_up == 0
Description Raises when the ZooKeeper service on a host node does not respond to Telegraf, typically indicating that ZooKeeper is down on that node. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Verify the ZooKeeper status on the affected node using service zookeeper status.
  • If ZooKeeper is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.
Tuning Not required

ZookeeperServiceErrorWarning

Severity Warning
Summary The ZooKeeper service on the {{ $labels.host }} node is not responding for 2 minutes.
Raise condition zookeeper_service_health == 0
Description Raises when the ZooKeeper service on a node is not healthy (in operational mode), typically indicating that the service is unresponsive due to a high load or an operating system or hardware issue on the node.
Troubleshooting
  • Inspect dmesg and /var/log/kern.log.
  • Inspect the logs in /var/log/zookeeper.
Tuning Not required

ZookeeperServicesDownMinor

Severity Minor
Summary More than 30% of ZooKeeper services are down for 2 minutes.
Raise condition count(zookeeper_up == 0) >= count(zookeeper_up) * 0.3
Description Raises when a ZooKeeper cluster has more than 30% of unavailable services.
Troubleshooting Inspect the ZooKeeper logs on any node of the affected cluster using journalctl -u zookeeper.
Tuning Not required

ZookeeperServicesDownMajor

Severity Major
Summary More than 60% of ZooKeeper services are down for 2 minutes.
Raise condition count(zookeeper_up == 0) >= count(zookeeper_up) * 0.6
Description Raises when a ZooKeeper cluster has more than 60% of unavailable services.
Troubleshooting Inspect the ZooKeeper logs on any node of the affected cluster using journalctl -u zookeeper.
Tuning Not required

ZookeeperServiceOutage

Severity Critical
Summary All ZooKeeper services are down for 2 minutes.
Raise condition count(zookeeper_up == 0) == count(zookeeper_up)
Description Raises when all ZooKeeper services across a cluster do not respond to Telegraf, typically indicating deployment or configuration issues.
Troubleshooting
  • Inspect the ZooKeeper logs on any node of the affected cluster using journalctl -u zookeeper.
  • If ZooKeeper is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.
Tuning Not required