Kubernetes
This section describes the alerts for Kubernetes.
ContainerScrapeError
Severity |
Warning |
Summary |
Prometheus was not able to scrape metrics from the container on the
{{ $labels.instance }} Kubernetes instance. |
Raise condition |
container_scrape_error != 0 |
Description |
Raises when cadvisor fails to scrape metrics from a container. |
Tuning |
Not required |
KubernetesProcessDown
Severity |
Minor |
Summary |
The Kubernetes {{ $labels.process_name }} process on the
{{ $labels.host }} node is down for 2 minutes. |
Raise condition |
procstat_running{process_name=~"hyperkube-.*"} == 0 |
Description |
Raises when Telegraf cannot find running hyperkube-kubelet ,
hyperkube-proxy , hyperkube-apiserver ,
hyperkube-controller-manager , and hyperkube-scheduler processes
on any ctl host and hyperkube-kubelet or hyperkube-proxy
processes on any cmp host. The process_name label in the raised
alert contains the process name. |
Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status .
- Verify the Docker status on the affected node using
systemctl docker status .
- For issues with
cmp , verify criproxy using
systemctl criproxy status .
- Inspect the logs in
/var/log/kubernetes.log .
|
Tuning |
Not required |
KubernetesProcessDownMinor
Severity |
Minor |
Summary |
{{ $value }} Kubernetes {{ $labels.process_name }} processes
(>= {{ instance_minor_threshold_percent * 100}}% ) are down for 2
minutes. |
Raise condition |
count(procstat_running{process_name=~"hyperkube-.*"} == 0) by
(process_name) > count(procstat_running{process_name=~"hyperkube-.*"})
by (process_name) * {{ instance_minor_threshold_percent }} |
Description |
Raises when Telegraf cannot find running hyperkube-kubelet ,
hyperkube-proxy , hyperkube-apiserver ,
hyperkube-controller-manager , and hyperkube-scheduler processes
on more than 30% of the ctl or cmp hosts. The process_name
label in the raised alert contains the process name. For the affected
nodes, see the host label in the KubernetesProcessDown alerts. |
Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status .
- Verify the Docker status on the affected node using
systemctl docker status .
- For issues with
cmp , verify criproxy using
systemctl criproxy status .
- Inspect the logs in
/var/log/kubernetes.log .
|
Tuning |
Not required |
KubernetesProcessDownMajor
Severity |
Major |
Summary |
{{ $value }} Kubernetes {{ $labels.process_name }} processes
(>= {{ instance_major_threshold_percent * 100}}% ) are down for 2
minutes. |
Raise condition |
count(procstat_running{process_name=~"hyperkube-.*"} == 0) by
(process_name) > count(procstat_running{process_name=~"hyperkube-.*"})
by (process_name) * {{ instance_major_threshold_percent }} |
Description |
Raises when Telegraf cannot find running hyperkube-kubelet ,
hyperkube-proxy , hyperkube-apiserver ,
hyperkube-controller-manager , and hyperkube-scheduler processes
on more than 60% of the ctl or cmp hosts. The process_name
label in the raised alert contains the process name. For the affected
nodes, see the host label in the KubernetesProcessDown alerts. |
Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status .
- Verify the Docker status on the affected node using
systemctl docker status .
- For issues with
cmp , verify criproxy using
systemctl criproxy status .
- Inspect the logs in
/var/log/kubernetes.log .
|
Tuning |
Not required |
KubernetesProcessOutage
Severity |
Critical |
Summary |
All Kubernetes {{ $labels.process_name }} processes are down for 2
minutes. |
Raise condition |
count(procstat_running{process_name=~"hyperkube-.*"}) by
(process_name) == count(procstat_running{process_name=~"hyperkube-.*"}
== 0) by (process_name) |
Description |
Raises when Telegraf cannot find running hyperkube-kubelet ,
hyperkube-proxy , hyperkube-apiserver ,
hyperkube-controller-manager , and hyperkube-scheduler processes
on all ctl and cmp hosts. The process_name label in the
raised alert contains the process name. |
Troubleshooting |
- Verify the containerd status on the affected node using
systemctl containerd status .
- Verify the Docker status on the affected node using
systemctl docker status .
- For issues with
cmp , verify criproxy using
systemctl criproxy status .
- Inspect the logs in
/var/log/kubernetes.log .
|
Tuning |
Not required |