Kafka

Kafka

This section describes the alerts for the Kafka service.


KafkaServiceDown

Severity

Minor

Summary

The Kafka service on the {{ $labels.host }} node is down.

Raise condition

procstat_running{process_name="kafka-server"} == 0

Description

Raises when Kafka on a particular host does not respond to Telegraf, typically indicating that Kafka is unavailable on that node. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the Kafka status by running systemctl status kafka on the affected node.

  • Inspect the Kafka logs on the affected node using journalctl -u kafka.

  • Inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required

KafkaServiceDownMinor

Severity

Minor

Summary

{{ $value }} Kafka services are down (at least {{monitoring.services_failed_warning_threshold_percent*100}}%).

Raise condition

count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.3

Description

Raises when the Kafka cluster has more than 30% of unavailable services.

Troubleshooting

  • Inspect the KafkaServiceDown alerts for the host names of the affected nodes.

  • Inspect the Kafka logs on the affected node using journalctl -u kafka.

Tuning

Not required

KafkaServiceDownMajor

Severity

Major

Summary

{{ $value }} Kafka services are down (at least {{monitoring.services_failed_critical_threshold_percent*100}}%).

Raise condition

count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.6

Description

Raises when the Kafka cluster has more than 60% of unavailable services.

Troubleshooting

  • Inspect the KafkaServiceDown alerts for the host names of the affected nodes.

  • Inspect the Kafka logs on the affected node using journalctl -u kafka.

Tuning

Not required

KafkaServiceOutage

Severity

Critical

Summary

All Kafka services are down.

Raise condition

count(procstat_running{process_name="kafka-server"} == 0) == count(procstat_running{process_name="kafka-server"})

Description

Raises when all Kafka services across the cluster do not respond to Telegraf, typically indicating deployment or configuration issues.

Troubleshooting

If Kafka is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required