Kubernetes

Kubernetes

This section describes the alerts for Kubernetes.


ContainerScrapeError

Severity Warning
Summary Prometheus was not able to scrape metrics from the container on the {{ $labels.instance }} Kubernetes instance.
Raise condition container_scrape_error != 0
Description Raises when cadvisor fails to scrape metrics from a container.
Tuning Not required

KubernetesProcessDown

Severity Minor
Summary The Kubernetes {{ $labels.process_name }} process on the {{ $labels.host }} node is down for 2 minutes.
Raise condition procstat_running{process_name=~"hyperkube-.*"} == 0
Description Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on any ctl host and hyperkube-kubelet or hyperkube-proxy processes on any cmp host. The process_name label in the raised alert contains the process name.
Troubleshooting
  • Verify the containerd status on the affected node using systemctl containerd status.
  • Verify the Docker status on the affected node using systemctl docker status.
  • For issues with cmp, verify criproxy using systemctl criproxy status.
  • Inspect the logs in /var/log/kubernetes.log.
Tuning Not required

KubernetesProcessDownMinor

Severity Minor
Summary {{ $value }} Kubernetes {{ $labels.process_name }} processes (>= {{ instance_minor_threshold_percent * 100}}%) are down for 2 minutes.
Raise condition count(procstat_running{process_name=~"hyperkube-.*"} == 0) by (process_name) > count(procstat_running{process_name=~"hyperkube-.*"}) by (process_name) * {{ instance_minor_threshold_percent }}
Description Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on more than 30% of the ctl or cmp hosts. The process_name label in the raised alert contains the process name. For the affected nodes, see the host label in the KubernetesProcessDown alerts.
Troubleshooting
  • Verify the containerd status on the affected node using systemctl containerd status.
  • Verify the Docker status on the affected node using systemctl docker status.
  • For issues with cmp, verify criproxy using systemctl criproxy status.
  • Inspect the logs in /var/log/kubernetes.log.
Tuning Not required

KubernetesProcessDownMajor

Severity Major
Summary {{ $value }} Kubernetes {{ $labels.process_name }} processes (>= {{ instance_major_threshold_percent * 100}}%) are down for 2 minutes.
Raise condition count(procstat_running{process_name=~"hyperkube-.*"} == 0) by (process_name) > count(procstat_running{process_name=~"hyperkube-.*"}) by (process_name) * {{ instance_major_threshold_percent }}
Description Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on more than 60% of the ctl or cmp hosts. The process_name label in the raised alert contains the process name. For the affected nodes, see the host label in the KubernetesProcessDown alerts.
Troubleshooting
  • Verify the containerd status on the affected node using systemctl containerd status.
  • Verify the Docker status on the affected node using systemctl docker status.
  • For issues with cmp, verify criproxy using systemctl criproxy status.
  • Inspect the logs in /var/log/kubernetes.log.
Tuning Not required

KubernetesProcessOutage

Severity Critical
Summary All Kubernetes {{ $labels.process_name }} processes are down for 2 minutes.
Raise condition count(procstat_running{process_name=~"hyperkube-.*"}) by (process_name) == count(procstat_running{process_name=~"hyperkube-.*"} == 0) by (process_name)
Description Raises when Telegraf cannot find running hyperkube-kubelet, hyperkube-proxy, hyperkube-apiserver, hyperkube-controller-manager, and hyperkube-scheduler processes on all ctl and cmp hosts. The process_name label in the raised alert contains the process name.
Troubleshooting
  • Verify the containerd status on the affected node using systemctl containerd status.
  • Verify the Docker status on the affected node using systemctl docker status.
  • For issues with cmp, verify criproxy using systemctl criproxy status.
  • Inspect the logs in /var/log/kubernetes.log.
Tuning Not required