Kubernetes
This section describes the alerts for Kubernetes.
ContainerScrapeError
| Severity |
Warning |
| Summary |
Prometheus was not able to scrape metrics from the container on the
{{ $labels.instance }} Kubernetes instance. |
| Raise condition |
container_scrape_error != 0 |
| Description |
Raises when cadvisor fails to scrape metrics from a container. |
| Tuning |
Not required |
KubernetesProcessDown
| Severity |
Minor |
| Summary |
The Kubernetes {{ $labels.process_name }} process on the
{{ $labels.host }} node is down for 2 minutes. |
| Raise condition |
procstat_running{process_name=~"hyperkube-.*"} == 0 |
| Description |
Raises when Telegraf cannot find running hyperkube-kubelet,
hyperkube-proxy, hyperkube-apiserver,
hyperkube-controller-manager, and hyperkube-scheduler processes
on any ctl host and hyperkube-kubelet or hyperkube-proxy
processes on any cmp host. The process_name label in the raised
alert contains the process name. |
| Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status.
- Verify the Docker status on the affected node using
systemctl docker status.
- For issues with
cmp, verify criproxy using
systemctl criproxy status.
- Inspect the logs in
/var/log/kubernetes.log.
|
| Tuning |
Not required |
KubernetesProcessDownMinor
| Severity |
Minor |
| Summary |
{{ $value }} Kubernetes {{ $labels.process_name }} processes
(>= {{ instance_minor_threshold_percent * 100}}%) are down for 2
minutes. |
| Raise condition |
count(procstat_running{process_name=~"hyperkube-.*"} == 0) by
(process_name) > count(procstat_running{process_name=~"hyperkube-.*"})
by (process_name) * {{ instance_minor_threshold_percent }} |
| Description |
Raises when Telegraf cannot find running hyperkube-kubelet,
hyperkube-proxy, hyperkube-apiserver,
hyperkube-controller-manager, and hyperkube-scheduler processes
on more than 30% of the ctl or cmp hosts. The process_name
label in the raised alert contains the process name. For the affected
nodes, see the host label in the KubernetesProcessDown alerts. |
| Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status.
- Verify the Docker status on the affected node using
systemctl docker status.
- For issues with
cmp, verify criproxy using
systemctl criproxy status.
- Inspect the logs in
/var/log/kubernetes.log.
|
| Tuning |
Not required |
KubernetesProcessDownMajor
| Severity |
Major |
| Summary |
{{ $value }} Kubernetes {{ $labels.process_name }} processes
(>= {{ instance_major_threshold_percent * 100}}%) are down for 2
minutes. |
| Raise condition |
count(procstat_running{process_name=~"hyperkube-.*"} == 0) by
(process_name) > count(procstat_running{process_name=~"hyperkube-.*"})
by (process_name) * {{ instance_major_threshold_percent }} |
| Description |
Raises when Telegraf cannot find running hyperkube-kubelet,
hyperkube-proxy, hyperkube-apiserver,
hyperkube-controller-manager, and hyperkube-scheduler processes
on more than 60% of the ctl or cmp hosts. The process_name
label in the raised alert contains the process name. For the affected
nodes, see the host label in the KubernetesProcessDown alerts. |
| Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status.
- Verify the Docker status on the affected node using
systemctl docker status.
- For issues with
cmp, verify criproxy using
systemctl criproxy status.
- Inspect the logs in
/var/log/kubernetes.log.
|
| Tuning |
Not required |
KubernetesProcessOutage
| Severity |
Critical |
| Summary |
All Kubernetes {{ $labels.process_name }} processes are down for 2
minutes. |
| Raise condition |
count(procstat_running{process_name=~"hyperkube-.*"}) by
(process_name) == count(procstat_running{process_name=~"hyperkube-.*"}
== 0) by (process_name) |
| Description |
Raises when Telegraf cannot find running hyperkube-kubelet,
hyperkube-proxy, hyperkube-apiserver,
hyperkube-controller-manager, and hyperkube-scheduler processes
on all ctl and cmp hosts. The process_name label in the
raised alert contains the process name. |
| Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status.
- Verify the Docker status on the affected node using
systemctl docker status.
- For issues with
cmp, verify criproxy using
systemctl criproxy status.
- Inspect the logs in
/var/log/kubernetes.log.
|
| Tuning |
Not required |