Kafka
This section describes the alerts for the Kafka service.
KafkaServiceDown
Severity |
Minor |
Summary |
The Kafka service on the {{ $labels.host }} node is down. |
Raise condition |
procstat_running{process_name="kafka-server"} == 0 |
Description |
Raises when Kafka on a particular host does not respond to Telegraf,
typically indicating that Kafka is unavailable on that node. The host
label in the raised alert contains the host name of the affected node. |
Troubleshooting |
- Verify the Kafka status by running
systemctl status kafka on the
affected node.
- Inspect the Kafka logs on the affected node using
journalctl -u kafka .
- Inspect the Telegraf logs on the affected node using
journalctl -u telegraf .
|
Tuning |
Not required |
KafkaServiceDownMinor
Severity |
Minor |
Summary |
{{ $value }} Kafka services are down (at least
{{monitoring.services_failed_warning_threshold_percent*100}}% ). |
Raise condition |
count(procstat_running{process_name="kafka-server"} == 0) >=
count(procstat_running{process_name="kafka-server"}) * 0.3 |
Description |
Raises when the Kafka cluster has more than 30% of unavailable services. |
Troubleshooting |
- Inspect the
KafkaServiceDown alerts for the host names of the
affected nodes.
- Inspect the Kafka logs on the affected node using
journalctl -u kafka .
|
Tuning |
Not required |
KafkaServiceDownMajor
Severity |
Major |
Summary |
{{ $value }} Kafka services are down (at least
{{monitoring.services_failed_critical_threshold_percent*100}}% ). |
Raise condition |
count(procstat_running{process_name="kafka-server"} == 0) >=
count(procstat_running{process_name="kafka-server"}) * 0.6 |
Description |
Raises when the Kafka cluster has more than 60% of unavailable services. |
Troubleshooting |
- Inspect the
KafkaServiceDown alerts for the host names of the
affected nodes.
- Inspect the Kafka logs on the affected node using
journalctl -u kafka .
|
Tuning |
Not required |
KafkaServiceOutage
Severity |
Critical |
Summary |
All Kafka services are down. |
Raise condition |
count(procstat_running{process_name="kafka-server"} == 0) ==
count(procstat_running{process_name="kafka-server"}) |
Description |
Raises when all Kafka services across the cluster do not respond to
Telegraf, typically indicating deployment or configuration issues. |
Troubleshooting |
If Kafka is up and running, inspect the Telegraf logs on the affected
node using journalctl -u telegraf . |
Tuning |
Not required |