Troubleshoot Kubernetes applications alerts¶
This section describes the investigation and troubleshooting steps for the Kubernetes applications alerts.
KubePodsCrashLooping¶
Related inhibited alert: KubePodsRegularLongTermRestarts
.
Root cause |
Termination of containers in Pods having |
---|---|
Investigation |
Note Verify if there are more alerts firing in the Container Cloud cluster to obtain more information on the cluster state and simplify issue investigation and mitigation. Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on. During investigation, the affected Pod will likely be in the
|
Mitigation |
Fixes typically fall into one of the following categories:
|
KubePodsNotReady¶
Root cause |
The Pod could not start successfully for the last 15 minutes, meaning that its status phase is one of the following:
|
---|---|
Investigation |
Note Verify if there are more alerts firing in the Container Cloud cluster to obtain more information on the cluster state and simplify issue investigation and mitigation. Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on.
|
Mitigation |
|
See also
KubePodsRegularLongTermRestarts¶
Related inhibiting alert: KubePodsCrashLooping
.
Root cause |
It is a long-term version of the |
---|---|
Investigation |
While investigating, the affected Pod will likely be in the
|
Mitigation |
Refer to the KubePodsCrashLooping Mitigation section. Fixing this issue may require more effort than simple application tuning. You may need to upgrade the application, upgrade its dependency libraries, or apply a fix in the application code. |
KubeDeploymentGenerationMismatch¶
Root cause |
Deployment generation, or
When the Deployment controller fails to observe a new Deployment version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert. |
---|---|
Investigation and mitigation |
The alert indicates failure of a Kubernetes built-in Deployment controller and requires debugging on the control plane level. See Troubleshooting for details on collecting cluster state and mitigating known issues. |
KubeDeploymentReplicasMismatch¶
Root cause |
The number of running Deployment replicas did not match the desired
state set in the |
---|---|
Investigation and mitigation |
Refer to KubePodsCrashLooping and KubePodsNotReady. |
KubeDeploymentOutage¶
Related inhibited alert: KubeDeploymentReplicasMismatch
.
Root cause |
The number of unavailable Deployment replicas exceeded the maximum configured, meaning that the application experiences severe performance degradation and is likely not available. |
---|---|
Investigation |
|
Mitigation |
Refer to KubePodsCrashLooping and KubePodsNotReady. |
KubeStatefulSetReplicasMismatch¶
Root cause |
The number of running StatefulSet replicas did not match the desired
state set in the |
---|---|
Investigation and mitigation |
Refer to KubePodsCrashLooping and KubePodsNotReady. |
KubeStatefulSetGenerationMismatch¶
Root cause |
StatefulSet generation, or
When the StatefulSet controller fails to observe a new StatefulSet version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert. |
---|---|
Investigation and mitigation |
The alert indicates failure of a Kubernetes built-in StatefulSet controller and requires debugging on the control plane level. See Troubleshooting for details on collecting cluster state and mitigating known issues. |
KubeStatefulSetOutage¶
Related inhibited alerts: KubeStatefulSetReplicasMismatch
and
KubeStatefulSetUpdateNotRolledOut
.
Root cause |
StatefulSet workloads are typically distributed across Kubernetes nodes. Therefore, losing more than one replica indicates either a serious application failure or issues on the Kubernetes cluster level. The application likely experiences severe performance degradation and availability issues. |
---|---|
Investigation |
|
Mitigation |
Refer to KubePodsCrashLooping and KubePodsNotReady. If after fixing the root cause on the Pod level the affected Pods are still non-Running, contact Mirantis support. StatefulSets must be treated with special caution as they store data and their internal state. |
KubeStatefulSetUpdateNotRolledOut¶
Root cause |
The StatefulSet update did not finish in 15 minutes, which was
detected in the mismatch of the |
---|---|
Investigation |
|
Mitigation |
Refer to KubePodsCrashLooping and KubePodsNotReady. If after fixing the root cause on the Pod level the affected Pods are still non-Running, contact Mirantis support. Treat StatefulSets with special caution as they store data and their internal state. Improper handling may result in a broken application cluster state and data loss. |
KubeDaemonSetRolloutStuck¶
Related inhibiting alert: KubeDaemonSetOutage
.
Root cause |
For the last 30 minutes, DaemonSet has at least one Pod (not necessarily the same one), which is not ready after being correctly scheduled. It may be caused by missing Pod requirements on the node or unexpected Pod termination. |
---|---|
Investigation |
|
Mitigation |
See KubePodsCrashLooping and KubePodsNotReady. |
See also
KubeDaemonSetNotScheduled¶
Can relate to: KubeCPUOvercommitPods
, KubeMemOvercommitPods
.
Root cause |
At least one Pod of the DaemonSet was not scheduled to a target node. This may happen if resource requests for the Pod cannot be satisfied by the node or if the node lacks other resources that the Pod requires, such as PV of a specific storage class. |
---|---|
Investigation |
|
Mitigation |
KubeDaemonSetMisScheduled¶
Root cause |
At least one node where the DaemonSet Pods were deployed got a
|
---|---|
Investigation |
|
Mitigation |
|
KubeDaemonSetOutage¶
Related inhibiting alert: KubeDaemonSetRolloutStuck
.
Root cause |
Although the DaemonSet was not scaled down to zero, there are zero healthy Pods. As each DaemonSet Pod is deployed to a separate Kubernetes node, such situation is rare and typically caused by a broken configuration (ConfigMaps or Secrets) or wrongly tuned resource limits. |
---|---|
Investigation |
|
Mitigation |
See KubePodsCrashLooping and KubePodsNotReady. |
See also
KubeCronJobRunning¶
Related alert: ClockSkewDetected
.
Root cause |
A CronJob Pod fails to start in 15 minutes from the configured schedule due to the following possible root causes:
|
---|---|
Investigation |
|
Mitigation |
|
KubeJobFailed¶
Related inhibited alert: KubePodsNotReady
.
Root cause |
At least one container of a Pod started by the Job exited with a
non-zero status or was terminated by the Kubernetes or Linux system.
A failing Pod triggers the |
---|---|
Investigation and mitigation |
See KubePodsNotReady. |
See also