Kafka

Kafka

This section describes the alerts for the Kafka service.


KafkaServiceDown

Severity Minor
Summary The Kafka service on the {{ $labels.host }} node is down.
Raise condition procstat_running{process_name="kafka-server"} == 0
Description Raises when Kafka on a particular host does not respond to Telegraf, typically indicating that Kafka is unavailable on that node. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Verify the Kafka status by running systemctl status kafka on the affected node.
  • Inspect the Kafka logs on the affected node using journalctl -u kafka.
  • Inspect the Telegraf logs on the affected node using journalctl -u telegraf.
Tuning Not required

KafkaServiceDownMinor

Severity Minor
Summary {{ $value }} Kafka services are down (at least {{monitoring.services_failed_warning_threshold_percent*100}}%).
Raise condition count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.3
Description Raises when the Kafka cluster has more than 30% of unavailable services.
Troubleshooting
  • Inspect the KafkaServiceDown alerts for the host names of the affected nodes.
  • Inspect the Kafka logs on the affected node using journalctl -u kafka.
Tuning Not required

KafkaServiceDownMajor

Severity Major
Summary {{ $value }} Kafka services are down (at least {{monitoring.services_failed_critical_threshold_percent*100}}%).
Raise condition count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.6
Description Raises when the Kafka cluster has more than 60% of unavailable services.
Troubleshooting
  • Inspect the KafkaServiceDown alerts for the host names of the affected nodes.
  • Inspect the Kafka logs on the affected node using journalctl -u kafka.
Tuning Not required

KafkaServiceOutage

Severity Critical
Summary All Kafka services are down.
Raise condition count(procstat_running{process_name="kafka-server"} == 0) == count(procstat_running{process_name="kafka-server"})
Description Raises when all Kafka services across the cluster do not respond to Telegraf, typically indicating deployment or configuration issues.
Troubleshooting If Kafka is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.
Tuning Not required