Troubleshoot Kubernetes applications alerts¶
This section describes the investigation and troubleshooting steps for the Kubernetes applications alerts.
KubePodsCrashLooping¶
Related inhibited alert: KubePodsRegularLongTermRestarts
.
Root cause |
Termination of containers in Pods having |
---|---|
Investigation |
Note Verify if there are more alerts firing in the MOSK cluster to obtain more information on the cluster state and simplify issue investigation and mitigation. Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on. During investigation, the affected Pod will likely be in the
|
Mitigation |
Fixes typically fall into one of the following categories:
|
KubePodsNotReady¶
Removed in 17.0.0 and 16.0.0
Root cause |
The Pod could not start successfully for the last 15 minutes, meaning that its status phase is one of the following:
|
---|---|
Investigation |
Note Verify if there are more alerts firing in the MOSK cluster to obtain more information on the cluster state and simplify issue investigation and mitigation. Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on.
|
Mitigation |
|
See also
KubePodsRegularLongTermRestarts¶
Related inhibiting alert: KubePodsCrashLooping
.
Root cause |
It is a long-term version of the |
---|---|
Investigation |
While investigating, the affected Pod will likely be in the
|
Mitigation |
Refer to the KubePodsCrashLooping Mitigation section. Fixing this issue may require more effort than simple application tuning. You may need to upgrade the application, upgrade its dependency libraries, or apply a fix in the application code. |
KubeDeploymentGenerationMismatch¶
Root cause |
Deployment generation, or
When the Deployment controller fails to observe a new Deployment version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert. |
---|---|
Investigation and mitigation |
The alert indicates failure of a Kubernetes built-in Deployment controller and requires debugging on the control plane level. See Troubleshooting Guide for details on collecting cluster state and mitigating known issues. |
KubeDeploymentReplicasMismatch¶
Root cause |
The number of available Deployment replicas did not match the desired
state set in the |
---|---|
Investigation and mitigation |
Refer to KubePodsCrashLooping. |
KubeDeploymentOutage¶
Related inhibited alert: KubeDeploymentReplicasMismatch
.
Root cause |
All Deployment replicas are unavailable for the last 10 minutes, meaning that the application is likely down. |
---|---|
Investigation |
|
Mitigation |
Refer to KubePodsCrashLooping. |
KubeStatefulSetReplicasMismatch¶
Root cause |
The number of running StatefulSet replicas did not match the desired
state set in the |
---|---|
Investigation and mitigation |
Refer to KubePodsCrashLooping. |
KubeStatefulSetGenerationMismatch¶
Root cause |
StatefulSet generation, or
When the StatefulSet controller fails to observe a new StatefulSet version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert. |
---|---|
Investigation and mitigation |
The alert indicates failure of a Kubernetes built-in StatefulSet controller and requires debugging on the control plane level. See Troubleshooting Guide for details on collecting cluster state and mitigating known issues. |
KubeStatefulSetOutage¶
Related inhibited alerts: KubeStatefulSetReplicasMismatch
and
KubeStatefulSetUpdateNotRolledOut
.
Root cause |
StatefulSet workloads are typically distributed across Kubernetes nodes. Therefore, losing more than one replica indicates either a serious application failure or issues on the Kubernetes cluster level. The application likely experiences severe performance degradation and availability issues. |
---|---|
Investigation |
|
Mitigation |
Refer to KubePodsCrashLooping. If after fixing the root cause
on the Pod level the affected Pods are still non- |
KubeStatefulSetUpdateNotRolledOut¶
Root cause |
The StatefulSet update did not finish in 30 minutes, which was
detected in the mismatch of the |
---|---|
Investigation |
|
Mitigation |
Refer to KubePodsCrashLooping. If after fixing the root cause on the Pod level the affected Pods are
still non- |
KubeDaemonSetRolloutStuck¶
Related inhibiting alert: KubeDaemonSetOutage
.
Root cause |
For the last 30 minutes, DaemonSet has at least one Pod (not necessarily the same one), which is not ready after being correctly scheduled. It may be caused by missing Pod requirements on the node or unexpected Pod termination. |
---|---|
Investigation |
|
Mitigation |
See KubePodsCrashLooping. |
See also
KubeDaemonSetNotScheduled¶
Can relate to: KubeCPUOvercommitPods
, KubeMemOvercommitPods
.
Root cause |
At least one Pod of the DaemonSet was not scheduled to a target node. This may happen if resource requests for the Pod cannot be satisfied by the node or if the node lacks other resources that the Pod requires, such as PV of a specific storage class. |
---|---|
Investigation |
|
Mitigation |
KubeDaemonSetMisScheduled¶
Removed in MCC 2.27.0 (17.2.0 and 16.2.0)
Root cause |
At least one node where the DaemonSet Pods were deployed got a
|
---|---|
Investigation |
|
Mitigation |
|
KubeDaemonSetOutage¶
Related inhibiting alert: KubeDaemonSetRolloutStuck
.
Root cause |
Although the DaemonSet was not scaled down to zero, there are zero healthy Pods. As each DaemonSet Pod is deployed to a separate Kubernetes node, such situation is rare and typically caused by a broken configuration (ConfigMaps or Secrets) or wrongly tuned resource limits. |
---|---|
Investigation |
|
Mitigation |
See KubePodsCrashLooping. |
See also
KubeCronJobRunning¶
Related alert: ClockSkewDetected
.
Root cause |
A CronJob Pod fails to start in 15 minutes from the configured schedule due to the following possible root causes:
|
---|---|
Investigation |
|
Mitigation |
|
KubeJobFailed¶
Related inhibited alert: KubePodsNotReady
.
Root cause |
At least one container of a Pod started by the Job exited with a non-zero status or was terminated by the Kubernetes or Linux system. |
---|---|
Investigation |
See KubePodsCrashLooping. |
Mitigation |
|
See also