ZooKeeper
This section describes the alerts for the ZooKeeper service.
ZookeeperServiceDown
Severity |
Minor |
Summary |
The ZooKeeper service on the {{ $labels.host }} node is down for 2
minutes. |
Raise condition |
zookeeper_up == 0 |
Description |
Raises when the ZooKeeper service on a host node does not respond to
Telegraf, typically indicating that ZooKeeper is down on that node. The
host label in the raised alert contains the host name of the
affected node. |
Troubleshooting |
- Verify the ZooKeeper status on the affected node using
service zookeeper status .
- If ZooKeeper is up and running, inspect the Telegraf logs on the
affected node using
journalctl -u telegraf .
|
Tuning |
Not required |
ZookeeperServiceErrorWarning
Severity |
Warning |
Summary |
The ZooKeeper service on the {{ $labels.host }} node is not
responding for 2 minutes. |
Raise condition |
zookeeper_service_health == 0 |
Description |
Raises when the ZooKeeper service on a node is not healthy (in
operational mode), typically indicating that the service is unresponsive
due to a high load or an operating system or hardware issue on the node. |
Troubleshooting |
- Inspect
dmesg and /var/log/kern.log .
- Inspect the logs in
/var/log/zookeeper .
|
Tuning |
Not required |
ZookeeperServicesDownMinor
Severity |
Minor |
Summary |
More than 30% of ZooKeeper services are down for 2 minutes. |
Raise condition |
count(zookeeper_up == 0) >= count(zookeeper_up) * 0.3 |
Description |
Raises when a ZooKeeper cluster has more than 30% of unavailable
services. |
Troubleshooting |
Inspect the ZooKeeper logs on any node of the affected cluster using
journalctl -u zookeeper . |
Tuning |
Not required |
ZookeeperServicesDownMajor
Severity |
Major |
Summary |
More than 60% of ZooKeeper services are down for 2 minutes. |
Raise condition |
count(zookeeper_up == 0) >= count(zookeeper_up) * 0.6 |
Description |
Raises when a ZooKeeper cluster has more than 60% of unavailable
services. |
Troubleshooting |
Inspect the ZooKeeper logs on any node of the affected cluster using
journalctl -u zookeeper . |
Tuning |
Not required |
ZookeeperServiceOutage
Severity |
Critical |
Summary |
All ZooKeeper services are down for 2 minutes. |
Raise condition |
count(zookeeper_up == 0) == count(zookeeper_up) |
Description |
Raises when all ZooKeeper services across a cluster do not respond to
Telegraf, typically indicating deployment or configuration issues. |
Troubleshooting |
- Inspect the ZooKeeper logs on any node of the affected cluster using
journalctl -u zookeeper .
- If ZooKeeper is up and running, inspect the Telegraf logs on the
affected node using
journalctl -u telegraf .
|
Tuning |
Not required |