Documentation Portal

Kafka

Kafka¶

This section describes the alerts for the Kafka service.

KafkaServiceDown
KafkaServiceDownMinor
KafkaServiceDownMajor
KafkaServiceOutage

KafkaServiceDown¶

Severity	Minor
Summary	The Kafka service on the `{{ $labels.host }}` node is down.
Raise condition	`procstat_running{process_name="kafka-server"} == 0`
Description	Raises when Kafka on a particular host does not respond to Telegraf, typically indicating that Kafka is unavailable on that node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the Kafka status by running `systemctl status kafka` on the affected node. Inspect the Kafka logs on the affected node using `journalctl -u kafka`. Inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

KafkaServiceDownMinor¶

Severity	Minor
Summary	`{{ $value }}` Kafka services are down (at least `{{monitoring.services_failed_warning_threshold_percent*100}}%`).
Raise condition	`count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.3`
Description	Raises when the Kafka cluster has more than 30% of unavailable services.
Troubleshooting	Inspect the `KafkaServiceDown` alerts for the host names of the affected nodes. Inspect the Kafka logs on the affected node using `journalctl -u kafka`.
Tuning	Not required

KafkaServiceDownMajor¶

Severity	Major
Summary	`{{ $value }}` Kafka services are down (at least `{{monitoring.services_failed_critical_threshold_percent*100}}%`).
Raise condition	`count(procstat_running{process_name="kafka-server"} == 0) >= count(procstat_running{process_name="kafka-server"}) * 0.6`
Description	Raises when the Kafka cluster has more than 60% of unavailable services.
Troubleshooting	Inspect the `KafkaServiceDown` alerts for the host names of the affected nodes. Inspect the Kafka logs on the affected node using `journalctl -u kafka`.
Tuning	Not required

KafkaServiceOutage¶

Severity	Critical
Summary	All Kafka services are down.
Raise condition	`count(procstat_running{process_name="kafka-server"} == 0) == count(procstat_running{process_name="kafka-server"})`
Description	Raises when all Kafka services across the cluster do not respond to Telegraf, typically indicating deployment or configuration issues.
Troubleshooting	If Kafka is up and running, inspect the Telegraf logs on the affected node using `journalctl -u telegraf`.
Tuning	Not required

updated: 2025-01-10 08:56

Cassandra

View Previous Section

OpenContrail

View Next Section