Troubleshoot Kubernetes applications alerts

This section describes the investigation and troubleshooting steps for the Kubernetes applications alerts.


KubePodsCrashLooping

Related inhibited alert: KubePodsRegularLongTermRestarts.

Root cause

Termination of containers in Pods having .spec.restartPolicy set to Always causes Kubernetes to bring them back. If the container exits again, kubelet exponentially increases the back-off delay between next restarts until it reaches 5 minutes. The Pods being stuck in restarts loop get the CrashLoopBackOff status. Because of the underlying metric inertia, StackLight measures restarts in an extended 20-minute time window.

Investigation

Note

Verify if there are more alerts firing in the Container Cloud cluster to obtain more information on the cluster state and simplify issue investigation and mitigation.

Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on.

During investigation, the affected Pod will likely be in the CrashLoopBackOff or Error state.

  1. List the unhealthy Pods of a particular application. Use the label selector, if possible.

    kubectl get pods -n <pod_namespace> -l '<pod_app_label>=<pod_app_name>' \
    -o=json | jq -r '.items[] | select(.status.phase != "Running") | \
    .metadata.name, .status.phase'
    
  2. Collect logs from one of the unhealthy Pods and inspect them for errors and stack traces:

    kubectl logs -n <pod_namespace> <pod_name>
    
  3. Inspect Kubernetes events or the termination reason and exit code of the Pod:

    kubectl describe pods -n <pod_namespace> <pod_name>
    

    Alternatively, inspect K8S Events in the OpenSearch Dashboards web UI.

  4. In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage.

    Important

    Performing the following step requires understanding of Kubernetes workloads.

  5. In some scenarios, observing Pods failing in real time may provide more insight on the issue. To investigate the application this way, restart (never with the --force flag) one of the failing Pods and inspect the following in the Kubernetes Pods Grafana dashboard, events and logs:

    • Define whether the issue reproduces.

    • Verify when does the issue reproduce in the Pod uptime: during the initialization or after some time.

    • Verify that the application requirements for Kubernetes resources and external dependencies are satisfied.

    • Define whether there is an issue with passing the readiness or liveness tests.

    • Define how the Pod container terminates and whether it is OOMKilled.

    Note

    While investigating, monitor the application health and verify the resource limits. Most issues can be solved by fixing the dependent application or tuning, such as providing additional flags, changing resource limits, and so on.

Mitigation

Fixes typically fall into one of the following categories:

  • Fix the dependent service. For example, fixing opensearch-master makes fluentd-logs Pods start successfully.

  • Fix the configuration if it causes container failure.

  • Tune the application by providing flags changing its behavior.

  • Tune the CPU or MEM limits if the system terminates a container upon hitting the memory limits (OOMKilled) or stops responding because of CPU throttling.

  • Fix code in case of application bugs.

KubePodsNotReady

Removed in 17.0.0, 16.0.0, and 14.1.0

Root cause

The Pod could not start successfully for the last 15 minutes, meaning that its status phase is one of the following:

  • Pending - at least one Pod container was not created. The Pod waits for the Kubernetes cluster to satisfy its requirement. For example, in case of failure to pull the Docker image or create a persistent volume.

  • Failed - the Pod terminated in the Error state and was not restarted. At least one container exited with a non-zero status code or was terminated by the system, for example, OOMKilled.

  • Unknown - kubelet communication issues.

Investigation

Note

Verify if there are more alerts firing in the Container Cloud cluster to obtain more information on the cluster state and simplify issue investigation and mitigation.

Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on.

  1. List the unhealthy Pods of the affected application. Use the label selector, if possible.

    kubectl get pods -n <pod_namespace> -l \
    '<pod_app_label>=<pod_app_name>' -o=json | jq -r '.items[] | \
    select(.status.phase != "Running") | .metadata.name'
    
  2. For one of the unhealthy Pods, verify Kubernetes events, termination reason, and exit code (for Failed only) of the Pod:

    kubectl describe pods -n <pod_namespace> <pod_name>
    

    Alternatively, inspect K8S Events in the OpenSearch Dashboards web UI.

  3. For Failed Pods, collect logs and inspect them for errors and stack traces:

    kubectl logs -n <pod_namespace> <pod_name>
    
  4. In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage.

Mitigation

  • For Pending, investigate and fix the root cause of the missing Pod requirements. For example, dependent application failure, unavailable Docker registry, unresponsive storage provided, and so on.

  • For Failed, see the KubePodsCrashLooping Mitigation section.

  • For Unknown, first verify and resolve the network-related alerts firing in the Kubernetes cluster.

KubePodsRegularLongTermRestarts

Related inhibiting alert: KubePodsCrashLooping.

Root cause

It is a long-term version of the KubePodsCrashLooping alert, aiming to catch Pod container restarts in wider time windows. The alert raises when the Pod container restarts once a day in a 2-days time frame. It may indicate that a pattern in the application lifecycle needs investigation, such as deadlocks, memory leaks, and so on.

Investigation

While investigating, the affected Pod will likely be in the Running state.

  1. List the Pods of the application, which containers were restarted at least twice. Use the label selector, if possible.

    kubectl get pods -n <pod_namespace> -l \
    '<pod_app_label>=<pod_app_name>' -o=json | jq -r '.items[] | \
    select(.status.phase != "Running") | .metadata.name, .status.phase'
    
  2. Collect logs for one of the affected Pods and inspect them for errors and stack traces:

    kubectl logs -n <pod_namespace> <pod_name>
    
  3. In the OpenSearch Dashboards web UI, inspect the K8S events dashboard. Filter the Pod using the kubernetes.event.involved_object.name key.

  4. In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage. Filter the affected Pod and find the point in time when the container was restarted. Observations may take several days.

Mitigation

Refer to the KubePodsCrashLooping Mitigation section. Fixing this issue may require more effort than simple application tuning. You may need to upgrade the application, upgrade its dependency libraries, or apply a fix in the application code.

KubeDeploymentGenerationMismatch

Root cause

Deployment generation, or version, occupies 2 fields in the object:

  • .metadata.generation (updated upon kubectl apply execution) - the desired Deployment generation.

  • .status.observedGeneration (triggers a new ReplicaSet rollout) - observed by Deployment controller.

When the Deployment controller fails to observe a new Deployment version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert.

Investigation and mitigation

The alert indicates failure of a Kubernetes built-in Deployment controller and requires debugging on the control plane level. See Troubleshooting for details on collecting cluster state and mitigating known issues.

KubeDeploymentReplicasMismatch

Root cause

The number of available Deployment replicas did not match the desired state set in the .spec.replicas field for the last 30 minutes, meaning that at least one Deployment Pod is down.

Investigation and mitigation

Refer to KubePodsCrashLooping.

KubeDeploymentOutage

Related inhibited alert: KubeDeploymentReplicasMismatch.

Root cause

All Deployment replicas are unavailable for the last 10 minutes, meaning that the application is likely down.

Investigation

  1. Verify the Deployment status:

    kubectl get deployment -n <deployment_namespace> <deployment_name>
    
  2. Inspect the related Kubernetes events for error messages and probe failures:

    kubectl describe deployment -n <deployment_namespace> <deployment_name>
    

    If events are unavailable, inspect K8S Events in the OpenSearch Dashboards web UI.

  3. List Pods of the Deployment and verify them one by one. Use label selectors, if possible:

    kubectl get pods -n <deployment_namespace> -l \
    '<deployment_app_label>=<deployment_app_name>'
    

    See KubePodsCrashLooping.

Mitigation

Refer to KubePodsCrashLooping.

KubeStatefulSetReplicasMismatch

Root cause

The number of running StatefulSet replicas did not match the desired state set in the .spec.replicas field for the last 30 minutes, meaning that at least one StatefulSet Pod is down.

Investigation and mitigation

Refer to KubePodsCrashLooping.

KubeStatefulSetGenerationMismatch

Root cause

StatefulSet generation, or version, occupies 2 fields in the object:

  • .metadata.generation (updated upon kubectl apply execution) - the desired StatefulSet generation.

  • .status.observedGeneration (triggers a new ReplicaSet rollout) - observed by StatefulSet controller.

When the StatefulSet controller fails to observe a new StatefulSet version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert.

Investigation and mitigation

The alert indicates failure of a Kubernetes built-in StatefulSet controller and requires debugging on the control plane level. See Troubleshooting for details on collecting cluster state and mitigating known issues.

KubeStatefulSetOutage

Related inhibited alerts: KubeStatefulSetReplicasMismatch and KubeStatefulSetUpdateNotRolledOut.

Root cause

StatefulSet workloads are typically distributed across Kubernetes nodes. Therefore, losing more than one replica indicates either a serious application failure or issues on the Kubernetes cluster level. The application likely experiences severe performance degradation and availability issues.

Investigation

  1. Verify the StatefulSet status:

    kubectl get sts -n <sts_namespace> <sts_name>
    
  2. Inspect the related Kubernetes events for error messages and probe failures

    kubectl describe sts -n <sts_namespace> <sts_name>
    

    If events are unavailable, inspect K8S Events in the OpenSearch Dashboards web UI.

  3. List the StatefulSet Pods and verify them one by one. Use the label selectors, if possible.

    kubectl get pods -n <sts_namespace> -l '<sts_app_label>=<sts_app_name>'
    

    See KubePodsCrashLooping.

Mitigation

Refer to KubePodsCrashLooping. If after fixing the root cause on the Pod level the affected Pods are still non-Running, contact Mirantis support. StatefulSets must be treated with special caution as they store data and their internal state.

KubeStatefulSetUpdateNotRolledOut

Root cause

The StatefulSet update did not finish in 30 minutes, which was detected in the mismatch of the .spec.replicas and .status.updatedReplicas fields. Such issue may occur during a rolling update if a newly created Pod fails to pass the readiness test and blocks the update.

Investigation

  1. Verify the rollout status:

    kubectl rollout status -n <sts_namespace> sts/<sts_name>
    

    The output includes the number of updated Pods. In Container Cloud, StatefulSets use the RollingUpdate strategy for upgrades and the Pod management policy does not affect updates. Therefore, investigation requires verifying the failing Pods only.

  2. List the non-Running Pods of the StatefulSet and inspect them one by one for error messages and probe failures. Use the label selectors, if possible.

    kubectl get pods -n <sts_namespace> -l \
    '<sts_app_label>=<sts_app_name>' -o=json | jq -r '.items[] | \
    select(.status.phase!="Running") | .metadata.name'
    

    See KubePodsCrashLooping. Pay special attention to the information about the application cluster issues, as clusters in Container Cloud are deployed as StatefulSets.

    If none of these alerts apply and the new Pod is stuck failing to pass postStartHook (Pod is in the PodInitializing state) or the readiness probe (Pod in the Running state, but not fully ready, for example, 0/1) it may be caused by Pod inability to join the application cluster. Investigating such issue requires understanding how the application cluster initializes and how nodes join the cluster. The PodInitializing state may be especially problematic as the kubectl logs command does not collect logs from such Pod.

    Warning

    Perform the following step with caution and remember to perform a rollback afterward.

    In some StatefulSets, disabling postStartHook unlocks the Pod to the Running state and allows for logs collection.

Mitigation

Refer to KubePodsCrashLooping.

If after fixing the root cause on the Pod level the affected Pods are still non-Running, contact Mirantis support. Treat StatefulSets with special caution as they store data and their internal state. Improper handling may result in a broken application cluster state and data loss.

KubeDaemonSetRolloutStuck

Related inhibiting alert: KubeDaemonSetOutage.

Root cause

For the last 30 minutes, DaemonSet has at least one Pod (not necessarily the same one), which is not ready after being correctly scheduled. It may be caused by missing Pod requirements on the node or unexpected Pod termination.

Investigation

  1. List the non-Running Pods of the DaemonSet:

    kubectl get pods -n <daemonset_namespace> -l \
    '<daemonset_app_label>=<daemonset_app_name>' -o json \
    | jq '.items[] | select(.status.phase!="Running") | .metadata.name'
    
  2. For the listed Pods, apply the steps described in the KubePodsCrashLooping Investigation section.

Mitigation

See KubePodsCrashLooping.

KubeDaemonSetNotScheduled

Can relate to: KubeCPUOvercommitPods, KubeMemOvercommitPods.

Root cause

At least one Pod of the DaemonSet was not scheduled to a target node. This may happen if resource requests for the Pod cannot be satisfied by the node or if the node lacks other resources that the Pod requires, such as PV of a specific storage class.

Investigation

  1. Identify the number of available and desired Pods of the DaemonSet:

    kubectl get daemonset -n <daemonset_namespace> <deamonset_name>
    
  2. Identify the nodes that already have the DaemonSet Pods running:

    kubectl get pods -n <daemonset_namespace> -l \
    '<daemonset_app_label>=<daemonset_app_name>' -o json \
    | jq -r '.items[].spec.nodeName'
    
  3. Compare the result with all nodes:

    kubectl get nodes
    
  4. Identify the nodes that do not have the DaemonSet Pods running:

    kubectl describe nodes <node_name>
    

    See the Allocated resources and Events sections to identify the node that has not enough resources.

Mitigation

See KubeCPUOvercommitPods and KubeMemOvercommitPods.

KubeDaemonSetMisScheduled

Root cause

At least one node where the DaemonSet Pods were deployed got a NoSchedule taint added afterward. Taints are respected during the scheduling stage only, and the Pod is currently considered unschedulable to such nodes.

Investigation

  1. List the taints of all Kubernetes cluster nodes:

    kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
    
  2. Verify the DaemonSet tolerations and currently occupied nodes:

    kubectl get daemonset -n <daemonset_namespace> <daemonset_name> -o \
    custom-columns=NAME:.metadata.name,TOLERATIONS:.spec.tolerations,NODE:.spec.nodeName
    
  3. Compare the output of the two commands above and define the nodes that should not have DaemonSet Pods deployed.

Mitigation

  • If the DaemonSet Pod should run on the affected nodes, add toleration for the corresponding taint in the DaemonSet.

  • If the DaemonSet Pod should not run on the affected nodes, delete the DaemonSet Pods from all nodes with a non-tolerated taint.

KubeDaemonSetOutage

Related inhibiting alert: KubeDaemonSetRolloutStuck.

Root cause

Although the DaemonSet was not scaled down to zero, there are zero healthy Pods. As each DaemonSet Pod is deployed to a separate Kubernetes node, such situation is rare and typically caused by a broken configuration (ConfigMaps or Secrets) or wrongly tuned resource limits.

Investigation

  1. Verify the DaemonSet status:

    kubectl get daemonset -n <daemonset_namespace> <daemonset_name>
    
  2. Inspect the related Kubernetes events for error messages and probe failures:

    kubectl describe daemonset -n <damonset_namespace> <damonset_name>
    

    If events are unavailable, inspect K8S Events in the OpenSearch Dashboards web UI.

  3. List the Deployment Pods and verify them one by one. Use the label selectors, if possible:

    kubectl get pods -n <damonset_namespace> -l '<damonset_app_label>=<damonset_app_name>'
    

Mitigation

See KubePodsCrashLooping.

KubeCronJobRunning

Related alert: ClockSkewDetected.

Root cause

A CronJob Pod fails to start in 15 minutes from the configured schedule due to the following possible root causes:

  1. The previously scheduled Pod is still running and the CronJob .spec.concurrencyPolicy was set to Forbid.

  2. The scheduled Job could not start in the CronJob .spec.startingDeadlineSeconds, if set.

Investigation

  1. Inspect the running CronJob Pods. Drop the label selector if none is available.

    kubectl get pods -n <cronjob_namespace> -l \
    '<cronjob_app_label>=<cronjob_app_name>' -o=json | jq -r '.items[] \
    | select(.status.phase=="Running") | .metadata.name'
    
  2. If Pod uptime is unusually long, it can overlap with the upcoming Jobs. Verify the concurrencyPolicy setting:

    kubectl get cronjob -n <cronjob_namespace> <conrjob_name> -o=json | \
    jq -r '.spec.concurrencyPolicy == "Forbid"'
    

    If the output is true, Kubernetes will not allow new Pods to run until the current one terminates. In this case, investigate and fix the issue on the application level.

  3. Collect logs and inspect the Pod resources usage:

    kubectl logs -n <cronjob_namespace> <cronjob_pod_name>
    
  4. If all CronJob Pods terminate normally, inspect Kubernetes events for the CronJob:

    kubectl describe cronjob -n <cronjob_namespace> <cronjob_name>
    

    In case of events similar to Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.:

    1. Verify if the ClockSkewDetected alert is firing for the affected cluster.

    2. Verify the current starting deadline value:

      kubectl get cronjob -n <cronjob_namespace> <conrjob_name> \
      -o=json | jq -r '.spec.startingDeadlineSeconds'
      

Mitigation

  • For root cause 1, fix the issue on the application level.

  • For root cause 2:

    1. If the ClockSkewDetected alert is firing for the affected cluster, resolve it first.

    2. If the CronJob issue is still present, depending on your application, remove or increase the .spec.startingDeadlineSeconds value.

KubeJobFailed

Related inhibited alert: KubePodsNotReady.

Root cause

At least one container of a Pod started by the Job exited with a non-zero status or was terminated by the Kubernetes or Linux system.

Investigation

See KubePodsCrashLooping.

Mitigation

  1. Investigate and fix the root cause of missing Pod requirements, such as failing dependency application, Docker registry unavailability, unresponsive storage provided, and so on.

  2. Use the Mitigation section in KubePodsCrashLooping.

  3. Verify and resolve network-related alerts firing in the Kubernetes cluster.