ZooKeeper

ZooKeeper

This section describes the alerts for the ZooKeeper service.


ZookeeperServiceDown

Severity

Minor

Summary

The ZooKeeper service on the {{ $labels.host }} node is down for 2 minutes.

Raise condition

zookeeper_up == 0

Description

Raises when the ZooKeeper service on a host node does not respond to Telegraf, typically indicating that ZooKeeper is down on that node. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the ZooKeeper status on the affected node using service zookeeper status.

  • If ZooKeeper is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required

ZookeeperServiceErrorWarning

Severity

Warning

Summary

The ZooKeeper service on the {{ $labels.host }} node is not responding for 2 minutes.

Raise condition

zookeeper_service_health == 0

Description

Raises when the ZooKeeper service on a node is not healthy (in operational mode), typically indicating that the service is unresponsive due to a high load or an operating system or hardware issue on the node.

Troubleshooting

  • Inspect dmesg and /var/log/kern.log.

  • Inspect the logs in /var/log/zookeeper.

Tuning

Not required

ZookeeperServicesDownMinor

Severity

Minor

Summary

More than 30% of ZooKeeper services are down for 2 minutes.

Raise condition

count(zookeeper_up == 0) >= count(zookeeper_up) * 0.3

Description

Raises when a ZooKeeper cluster has more than 30% of unavailable services.

Troubleshooting

Inspect the ZooKeeper logs on any node of the affected cluster using journalctl -u zookeeper.

Tuning

Not required

ZookeeperServicesDownMajor

Severity

Major

Summary

More than 60% of ZooKeeper services are down for 2 minutes.

Raise condition

count(zookeeper_up == 0) >= count(zookeeper_up) * 0.6

Description

Raises when a ZooKeeper cluster has more than 60% of unavailable services.

Troubleshooting

Inspect the ZooKeeper logs on any node of the affected cluster using journalctl -u zookeeper.

Tuning

Not required

ZookeeperServiceOutage

Severity

Critical

Summary

All ZooKeeper services are down for 2 minutes.

Raise condition

count(zookeeper_up == 0) == count(zookeeper_up)

Description

Raises when all ZooKeeper services across a cluster do not respond to Telegraf, typically indicating deployment or configuration issues.

Troubleshooting

  • Inspect the ZooKeeper logs on any node of the affected cluster using journalctl -u zookeeper.

  • If ZooKeeper is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required