Kubernetes

Kubernetes

This section describes the alerts for Kubernetes.


ContainerScrapeError

Severity

Warning

Summary

Prometheus was not able to scrape metrics from the container on the {{ $labels.instance }} Kubernetes instance.

Raise condition

container_scrape_error != 0

Description

Raises when cadvisor fails to scrape metrics from a container.

Tuning

Not required

KubernetesProcessDown

Severity

Minor

Summary

The Kubernetes {{ $labels.process_name }} process on the {{ $labels.host }} node is down for 2 minutes.

Raise condition

procstat_running{process_name=~"hyperkube-.*"} == 0

Description

Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on any ctl host and hyperkube-kubelet or hyperkube-proxy processes on any cmp host. The process_name label in the raised alert contains the process name.

Troubleshooting

  • Verify the containerd status on the affected node using systemctl containerd status.

  • Verify the Docker status on the affected node using systemctl docker status.

  • For issues with cmp, verify criproxy using systemctl criproxy status.

  • Inspect the logs in /var/log/kubernetes.log.

Tuning

Not required

KubernetesProcessDownMinor

Severity

Minor

Summary

{{ $value }} Kubernetes {{ $labels.process_name }} processes (>= {{ instance_minor_threshold_percent * 100}}%) are down for 2 minutes.

Raise condition

count(procstat_running{process_name=~"hyperkube-.*"} == 0) by (process_name) > count(procstat_running{process_name=~"hyperkube-.*"}) by (process_name) * {{ instance_minor_threshold_percent }}

Description

Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on more than 30% of the ctl or cmp hosts. The process_name label in the raised alert contains the process name. For the affected nodes, see the host label in the KubernetesProcessDown alerts.

Troubleshooting

  • Verify the containerd status on the affected node using systemctl containerd status.

  • Verify the Docker status on the affected node using systemctl docker status.

  • For issues with cmp, verify criproxy using systemctl criproxy status.

  • Inspect the logs in /var/log/kubernetes.log.

Tuning

Not required

KubernetesProcessDownMajor

Severity

Major

Summary

{{ $value }} Kubernetes {{ $labels.process_name }} processes (>= {{ instance_major_threshold_percent * 100}}%) are down for 2 minutes.

Raise condition

count(procstat_running{process_name=~"hyperkube-.*"} == 0) by (process_name) > count(procstat_running{process_name=~"hyperkube-.*"}) by (process_name) * {{ instance_major_threshold_percent }}

Description

Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on more than 60% of the ctl or cmp hosts. The process_name label in the raised alert contains the process name. For the affected nodes, see the host label in the KubernetesProcessDown alerts.

Troubleshooting

  • Verify the containerd status on the affected node using systemctl containerd status.

  • Verify the Docker status on the affected node using systemctl docker status.

  • For issues with cmp, verify criproxy using systemctl criproxy status.

  • Inspect the logs in /var/log/kubernetes.log.

Tuning

Not required

KubernetesProcessOutage

Severity

Critical

Summary

All Kubernetes {{ $labels.process_name }} processes are down for 2 minutes.

Raise condition

count(procstat_running{process_name=~"hyperkube-.*"}) by (process_name) == count(procstat_running{process_name=~"hyperkube-.*"} == 0) by (process_name)

Description

Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on all ctl and cmp hosts. The process_name label in the raised alert contains the process name.

Troubleshooting

  • Verify the containerd status on the affected node using systemctl containerd status.

  • Verify the Docker status on the affected node using systemctl docker status.

  • For issues with cmp, verify criproxy using systemctl criproxy status.

  • Inspect the logs in /var/log/kubernetes.log.

Tuning

Not required