Troubleshoot Kubernetes applications alerts¶

This section describes the investigation and troubleshooting steps for the Kubernetes applications alerts.

KubePodsCrashLooping
KubePodsNotReady
KubePodsRegularLongTermRestarts
KubeDeploymentGenerationMismatch
KubeDeploymentReplicasMismatch
KubeDeploymentOutage
KubeStatefulSetReplicasMismatch
KubeStatefulSetGenerationMismatch
KubeStatefulSetOutage
KubeStatefulSetUpdateNotRolledOut
KubeDaemonSetRolloutStuck
KubeDaemonSetNotScheduled
KubeDaemonSetMisScheduled
KubeDaemonSetOutage
KubeCronJobRunning
KubeJobFailed

KubePodsCrashLooping¶

Related inhibited alert: KubePodsRegularLongTermRestarts.

Root cause	Termination of containers in Pods having `.spec.restartPolicy` set to `Always` causes Kubernetes to bring them back. If the container exits again, kubelet exponentially increases the back-off delay between next restarts until it reaches 5 minutes. The Pods being stuck in restarts loop get the `CrashLoopBackOff` status. Because of the underlying metric inertia, StackLight measures restarts in an extended 20-minute time window.
Investigation	Note Verify if there are more alerts firing in the MOSK cluster to obtain more information on the cluster state and simplify issue investigation and mitigation. Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on. During investigation, the affected Pod will likely be in the `CrashLoopBackOff` or `Error` state. List the unhealthy Pods of a particular application. Use the label selector, if possible. kubectl get pods -n <pod_namespace> -l '<pod_app_label>=<pod_app_name>' \ -o=json \| jq -r '.items[] \| select(.status.phase != "Running") \| \ .metadata.name, .status.phase' Collect logs from one of the unhealthy Pods and inspect them for errors and stack traces: kubectl logs -n <pod_namespace> <pod_name> Inspect Kubernetes events or the termination reason and exit code of the Pod: kubectl describe pods -n <pod_namespace> <pod_name> Alternatively, inspect K8S Events in the OpenSearch Dashboards web UI. In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage. Important Performing the following step requires understanding of Kubernetes workloads. In some scenarios, observing Pods failing in real time may provide more insight on the issue. To investigate the application this way, restart (never with the `--force` flag) one of the failing Pods and inspect the following in the Kubernetes Pods Grafana dashboard, events and logs: Define whether the issue reproduces. Verify when does the issue reproduce in the Pod uptime: during the initialization or after some time. Verify that the application requirements for Kubernetes resources and external dependencies are satisfied. Define whether there is an issue with passing the readiness or liveness tests. Define how the Pod container terminates and whether it is `OOMKilled`. Note While investigating, monitor the application health and verify the resource limits. Most issues can be solved by fixing the dependent application or tuning, such as providing additional flags, changing resource limits, and so on.
Mitigation	Fixes typically fall into one of the following categories: Fix the dependent service. For example, fixing `opensearch-master` makes `fluentd-logs` Pods start successfully. Fix the configuration if it causes container failure. Tune the application by providing flags changing its behavior. Tune the CPU or MEM limits if the system terminates a container upon hitting the memory limits (`OOMKilled`) or stops responding because of CPU throttling. Fix code in case of application bugs.

Root cause

Termination of containers in Pods having .spec.restartPolicy set to Always causes Kubernetes to bring them back. If the container exits again, kubelet exponentially increases the back-off delay between next restarts until it reaches 5 minutes. The Pods being stuck in restarts loop get the CrashLoopBackOff status. Because of the underlying metric inertia, StackLight measures restarts in an extended 20-minute time window.

Investigation

Note

Verify if there are more alerts firing in the MOSK cluster to obtain more information on the cluster state and simplify issue investigation and mitigation.

Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on.

During investigation, the affected Pod will likely be in the CrashLoopBackOff or Error state.

List the unhealthy Pods of a particular application. Use the label selector, if possible.

kubectl get pods -n <pod_namespace> -l '<pod_app_label>=<pod_app_name>' \
-o=json | jq -r '.items[] | select(.status.phase != "Running") | \
.metadata.name, .status.phase'

Collect logs from one of the unhealthy Pods and inspect them for errors and stack traces:
```
kubectl logs -n <pod_namespace> <pod_name>
```
Inspect Kubernetes events or the termination reason and exit code of the Pod:
```
kubectl describe pods -n <pod_namespace> <pod_name>
```
Alternatively, inspect K8S Events in the OpenSearch Dashboards web UI.
In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage.

Important

Performing the following step requires understanding of Kubernetes workloads.
In some scenarios, observing Pods failing in real time may provide more insight on the issue. To investigate the application this way, restart (never with the --force flag) one of the failing Pods and inspect the following in the Kubernetes Pods Grafana dashboard, events and logs:
- Define whether the issue reproduces.
- Verify when does the issue reproduce in the Pod uptime: during the initialization or after some time.
- Verify that the application requirements for Kubernetes resources and external dependencies are satisfied.
- Define whether there is an issue with passing the readiness or liveness tests.
- Define how the Pod container terminates and whether it is OOMKilled.
Note

While investigating, monitor the application health and verify the resource limits. Most issues can be solved by fixing the dependent application or tuning, such as providing additional flags, changing resource limits, and so on.

Mitigation

Fixes typically fall into one of the following categories:

Fix the dependent service. For example, fixing opensearch-master makes fluentd-logs Pods start successfully.
Fix the configuration if it causes container failure.
Tune the application by providing flags changing its behavior.
Tune the CPU or MEM limits if the system terminates a container upon hitting the memory limits (OOMKilled) or stops responding because of CPU throttling.
Fix code in case of application bugs.

KubePodsNotReady¶

Removed in 17.0.0 and 16.0.0

Root cause	The Pod could not start successfully for the last 15 minutes, meaning that its status phase is one of the following: `Pending` - at least one Pod container was not created. The Pod waits for the Kubernetes cluster to satisfy its requirement. For example, in case of failure to pull the Docker image or create a persistent volume. `Failed` - the Pod terminated in the `Error` state and was not restarted. At least one container exited with a non-zero status code or was terminated by the system, for example, `OOMKilled`. `Unknown` - kubelet communication issues.
Investigation	Note Verify if there are more alerts firing in the MOSK cluster to obtain more information on the cluster state and simplify issue investigation and mitigation. Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on. List the unhealthy Pods of the affected application. Use the label selector, if possible. kubectl get pods -n <pod_namespace> -l \ '<pod_app_label>=<pod_app_name>' -o=json \| jq -r '.items[] \| \ select(.status.phase != "Running") \| .metadata.name' For one of the unhealthy Pods, verify Kubernetes events, termination reason, and exit code (for `Failed` only) of the Pod: kubectl describe pods -n <pod_namespace> <pod_name> Alternatively, inspect K8S Events in the OpenSearch Dashboards web UI. For `Failed` Pods, collect logs and inspect them for errors and stack traces: kubectl logs -n <pod_namespace> <pod_name> In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage.
Mitigation	For `Pending`, investigate and fix the root cause of the missing Pod requirements. For example, dependent application failure, unavailable Docker registry, unresponsive storage provided, and so on. For `Failed`, see the KubePodsCrashLooping Mitigation section. For `Unknown`, first verify and resolve the network-related alerts firing in the Kubernetes cluster.

Root cause

The Pod could not start successfully for the last 15 minutes, meaning that its status phase is one of the following:

Pending - at least one Pod container was not created. The Pod waits for the Kubernetes cluster to satisfy its requirement. For example, in case of failure to pull the Docker image or create a persistent volume.
Failed - the Pod terminated in the Error state and was not restarted. At least one container exited with a non-zero status code or was terminated by the system, for example, OOMKilled.
Unknown - kubelet communication issues.

Investigation

Note

Verify if there are more alerts firing in the MOSK cluster to obtain more information on the cluster state and simplify issue investigation and mitigation.

Also examine the relation of the affected application with other applications (dependencies) and Kubernetes resources it relies on.

List the unhealthy Pods of the affected application. Use the label selector, if possible.

kubectl get pods -n <pod_namespace> -l \
'<pod_app_label>=<pod_app_name>' -o=json | jq -r '.items[] | \
select(.status.phase != "Running") | .metadata.name'

For one of the unhealthy Pods, verify Kubernetes events, termination reason, and exit code (for Failed only) of the Pod:
```
kubectl describe pods -n <pod_namespace> <pod_name>
```
Alternatively, inspect K8S Events in the OpenSearch Dashboards web UI.
For Failed Pods, collect logs and inspect them for errors and stack traces:
```
kubectl logs -n <pod_namespace> <pod_name>
```
In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage.

Mitigation

For Pending, investigate and fix the root cause of the missing Pod requirements. For example, dependent application failure, unavailable Docker registry, unresponsive storage provided, and so on.
For Failed, see the KubePodsCrashLooping Mitigation section.
For Unknown, first verify and resolve the network-related alerts firing in the Kubernetes cluster.

See also

Kubernetes documentation: Pod phase

KubePodsRegularLongTermRestarts¶

Related inhibiting alert: KubePodsCrashLooping.

Root cause	It is a long-term version of the `KubePodsCrashLooping` alert, aiming to catch Pod container restarts in wider time windows. The alert raises when the Pod container restarts once a day in a 2-days time frame. It may indicate that a pattern in the application lifecycle needs investigation, such as deadlocks, memory leaks, and so on.
Investigation	While investigating, the affected Pod will likely be in the `Running` state. List the Pods of the application, which containers were restarted at least twice. Use the label selector, if possible. kubectl get pods -n <pod_namespace> -l \ '<pod_app_label>=<pod_app_name>' -o=json \| jq -r '.items[] \| \ select(.status.phase != "Running") \| .metadata.name, .status.phase' Collect logs for one of the affected Pods and inspect them for errors and stack traces: kubectl logs -n <pod_namespace> <pod_name> In the OpenSearch Dashboards web UI, inspect the K8S events dashboard. Filter the Pod using the `kubernetes.event.involved_object.name` key. In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage. Filter the affected Pod and find the point in time when the container was restarted. Observations may take several days.
Mitigation	Refer to the KubePodsCrashLooping Mitigation section. Fixing this issue may require more effort than simple application tuning. You may need to upgrade the application, upgrade its dependency libraries, or apply a fix in the application code.

Root cause

It is a long-term version of the KubePodsCrashLooping alert, aiming to catch Pod container restarts in wider time windows. The alert raises when the Pod container restarts once a day in a 2-days time frame. It may indicate that a pattern in the application lifecycle needs investigation, such as deadlocks, memory leaks, and so on.

Investigation

While investigating, the affected Pod will likely be in the Running state.

List the Pods of the application, which containers were restarted at least twice. Use the label selector, if possible.

kubectl get pods -n <pod_namespace> -l \
'<pod_app_label>=<pod_app_name>' -o=json | jq -r '.items[] | \
select(.status.phase != "Running") | .metadata.name, .status.phase'

Collect logs for one of the affected Pods and inspect them for errors and stack traces:
```
kubectl logs -n <pod_namespace> <pod_name>
```
In the OpenSearch Dashboards web UI, inspect the K8S events dashboard. Filter the Pod using the kubernetes.event.involved_object.name key.
In the Kubernetes Pods Grafana dashboard, monitor the Pod resources usage. Filter the affected Pod and find the point in time when the container was restarted. Observations may take several days.

Mitigation

Refer to the KubePodsCrashLooping Mitigation section. Fixing this issue may require more effort than simple application tuning. You may need to upgrade the application, upgrade its dependency libraries, or apply a fix in the application code.

See also

Kubernetes documentation: Pod Lifecycle

KubeDeploymentGenerationMismatch¶

Root cause	Deployment generation, or `version`, occupies 2 fields in the object: `.metadata.generation` (updated upon kubectl apply execution) - the desired Deployment generation. `.status.observedGeneration` (triggers a new `ReplicaSet` rollout) - observed by Deployment controller. When the Deployment controller fails to observe a new Deployment version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert.
Investigation and mitigation	The alert indicates failure of a Kubernetes built-in Deployment controller and requires debugging on the control plane level. See Troubleshooting Guide for details on collecting cluster state and mitigating known issues.

Root cause

Deployment generation, or version, occupies 2 fields in the object:

.metadata.generation (updated upon kubectl apply execution) - the desired Deployment generation.
.status.observedGeneration (triggers a new ReplicaSet rollout) - observed by Deployment controller.

When the Deployment controller fails to observe a new Deployment version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert.

Investigation and mitigation

The alert indicates failure of a Kubernetes built-in Deployment controller and requires debugging on the control plane level. See Troubleshooting Guide for details on collecting cluster state and mitigating known issues.

See also

KubeDeploymentReplicasMismatch¶

Root cause	The number of available Deployment replicas did not match the desired state set in the `.spec.replicas` field for the last 30 minutes, meaning that at least one Deployment Pod is down.
Investigation and mitigation	Refer to KubePodsCrashLooping.

KubeDeploymentOutage¶

Related inhibited alert: KubeDeploymentReplicasMismatch.

Root cause	All Deployment replicas are unavailable for the last 10 minutes, meaning that the application is likely down.
Investigation	Verify the Deployment status: kubectl get deployment -n <deployment_namespace> <deployment_name> Inspect the related Kubernetes events for error messages and probe failures: kubectl describe deployment -n <deployment_namespace> <deployment_name> If events are unavailable, inspect K8S Events in the OpenSearch Dashboards web UI. List Pods of the Deployment and verify them one by one. Use label selectors, if possible: kubectl get pods -n <deployment_namespace> -l \ '<deployment_app_label>=<deployment_app_name>' See KubePodsCrashLooping.
Mitigation	Refer to KubePodsCrashLooping.

KubeStatefulSetReplicasMismatch¶

Root cause	The number of running StatefulSet replicas did not match the desired state set in the `.spec.replicas` field for the last 30 minutes, meaning that at least one StatefulSet Pod is down.
Investigation and mitigation	Refer to KubePodsCrashLooping.

KubeStatefulSetGenerationMismatch¶

Root cause	StatefulSet generation, or `version`, occupies 2 fields in the object: `.metadata.generation` (updated upon kubectl apply execution) - the desired StatefulSet generation. `.status.observedGeneration` (triggers a new `ReplicaSet` rollout) - observed by StatefulSet controller. When the StatefulSet controller fails to observe a new StatefulSet version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert.
Investigation and mitigation	The alert indicates failure of a Kubernetes built-in StatefulSet controller and requires debugging on the control plane level. See Troubleshooting Guide for details on collecting cluster state and mitigating known issues.

Root cause

StatefulSet generation, or version, occupies 2 fields in the object:

.metadata.generation (updated upon kubectl apply execution) - the desired StatefulSet generation.
.status.observedGeneration (triggers a new ReplicaSet rollout) - observed by StatefulSet controller.

When the StatefulSet controller fails to observe a new StatefulSet version, these 2 fields differ. The mismatch lasting for more than 15 minutes triggers the alert.

Investigation and mitigation

The alert indicates failure of a Kubernetes built-in StatefulSet controller and requires debugging on the control plane level. See Troubleshooting Guide for details on collecting cluster state and mitigating known issues.

KubeStatefulSetOutage¶

Related inhibited alerts: KubeStatefulSetReplicasMismatch and KubeStatefulSetUpdateNotRolledOut.

Root cause	StatefulSet workloads are typically distributed across Kubernetes nodes. Therefore, losing more than one replica indicates either a serious application failure or issues on the Kubernetes cluster level. The application likely experiences severe performance degradation and availability issues.
Investigation	Verify the StatefulSet status: kubectl get sts -n <sts_namespace> <sts_name> Inspect the related Kubernetes events for error messages and probe failures kubectl describe sts -n <sts_namespace> <sts_name> If events are unavailable, inspect K8S Events in the OpenSearch Dashboards web UI. List the StatefulSet Pods and verify them one by one. Use the label selectors, if possible. kubectl get pods -n <sts_namespace> -l '<sts_app_label>=<sts_app_name>' See KubePodsCrashLooping.
Mitigation	Refer to KubePodsCrashLooping. If after fixing the root cause on the Pod level the affected Pods are still non-`Running`, contact Mirantis support. StatefulSets must be treated with special caution as they store data and their internal state.

KubeStatefulSetUpdateNotRolledOut¶

Root cause	The StatefulSet update did not finish in 30 minutes, which was detected in the mismatch of the `.spec.replicas` and `.status.updatedReplicas` fields. Such issue may occur during a rolling update if a newly created Pod fails to pass the readiness test and blocks the update.
Investigation	Verify the rollout status: kubectl rollout status -n <sts_namespace> sts/<sts_name> The output includes the number of updated Pods. In Container Cloud, StatefulSets use the `RollingUpdate` strategy for upgrades and the Pod management policy does not affect updates. Therefore, investigation requires verifying the failing Pods only. List the non-`Running` Pods of the StatefulSet and inspect them one by one for error messages and probe failures. Use the label selectors, if possible. kubectl get pods -n <sts_namespace> -l \ '<sts_app_label>=<sts_app_name>' -o=json \| jq -r '.items[] \| \ select(.status.phase!="Running") \| .metadata.name' See KubePodsCrashLooping. Pay special attention to the information about the application cluster issues, as clusters in Container Cloud are deployed as StatefulSets. If none of these alerts apply and the new Pod is stuck failing to pass `postStartHook` (Pod is in the `PodInitializing` state) or the readiness probe (Pod in the `Running` state, but not fully ready, for example, `0/1`) it may be caused by Pod inability to join the application cluster. Investigating such issue requires understanding how the application cluster initializes and how nodes join the cluster. The `PodInitializing` state may be especially problematic as the kubectl logs command does not collect logs from such Pod. Warning Perform the following step with caution and remember to perform a rollback afterward. In some StatefulSets, disabling `postStartHook` unlocks the Pod to the `Running` state and allows for logs collection.
Mitigation	Refer to KubePodsCrashLooping. If after fixing the root cause on the Pod level the affected Pods are still non-`Running`, contact Mirantis support. Treat StatefulSets with special caution as they store data and their internal state. Improper handling may result in a broken application cluster state and data loss.

Root cause

The StatefulSet update did not finish in 30 minutes, which was detected in the mismatch of the .spec.replicas and .status.updatedReplicas fields. Such issue may occur during a rolling update if a newly created Pod fails to pass the readiness test and blocks the update.

Investigation

Verify the rollout status:
```
kubectl rollout status -n <sts_namespace> sts/<sts_name>
```
The output includes the number of updated Pods. In Container Cloud, StatefulSets use the RollingUpdate strategy for upgrades and the Pod management policy does not affect updates. Therefore, investigation requires verifying the failing Pods only.
List the non-Running Pods of the StatefulSet and inspect them one by one for error messages and probe failures. Use the label selectors, if possible.
```
kubectl get pods -n <sts_namespace> -l \
'<sts_app_label>=<sts_app_name>' -o=json | jq -r '.items[] | \
select(.status.phase!="Running") | .metadata.name'
```
See KubePodsCrashLooping. Pay special attention to the information about the application cluster issues, as clusters in Container Cloud are deployed as StatefulSets.

If none of these alerts apply and the new Pod is stuck failing to pass postStartHook (Pod is in the PodInitializing state) or the readiness probe (Pod in the Running state, but not fully ready, for example, 0/1) it may be caused by Pod inability to join the application cluster. Investigating such issue requires understanding how the application cluster initializes and how nodes join the cluster. The PodInitializing state may be especially problematic as the kubectl logs command does not collect logs from such Pod.

Warning

Perform the following step with caution and remember to perform a rollback afterward.

In some StatefulSets, disabling postStartHook unlocks the Pod to the Running state and allows for logs collection.

Mitigation

Refer to KubePodsCrashLooping.

If after fixing the root cause on the Pod level the affected Pods are still non-Running, contact Mirantis support. Treat StatefulSets with special caution as they store data and their internal state. Improper handling may result in a broken application cluster state and data loss.

See also

KubeDaemonSetRolloutStuck¶

Related inhibiting alert: KubeDaemonSetOutage.

Root cause	For the last 30 minutes, DaemonSet has at least one Pod (not necessarily the same one), which is not ready after being correctly scheduled. It may be caused by missing Pod requirements on the node or unexpected Pod termination.
Investigation	List the non-Running Pods of the DaemonSet: kubectl get pods -n <daemonset_namespace> -l \ '<daemonset_app_label>=<daemonset_app_name>' -o json \ \| jq '.items[] \| select(.status.phase!="Running") \| .metadata.name' For the listed Pods, apply the steps described in the KubePodsCrashLooping Investigation section.
Mitigation	See KubePodsCrashLooping.

See also

Kubernetes documentation: DaemonSet

KubeDaemonSetNotScheduled¶

Can relate to: KubeCPUOvercommitPods, KubeMemOvercommitPods.

Root cause	At least one Pod of the DaemonSet was not scheduled to a target node. This may happen if resource requests for the Pod cannot be satisfied by the node or if the node lacks other resources that the Pod requires, such as PV of a specific storage class.
Investigation	Identify the number of available and desired Pods of the DaemonSet: kubectl get daemonset -n <daemonset_namespace> <deamonset_name> Identify the nodes that already have the DaemonSet Pods running: kubectl get pods -n <daemonset_namespace> -l \ '<daemonset_app_label>=<daemonset_app_name>' -o json \ \| jq -r '.items[].spec.nodeName' Compare the result with all nodes: kubectl get nodes Identify the nodes that do not have the DaemonSet Pods running: kubectl describe nodes <node_name> See the `Allocated resources` and `Events` sections to identify the node that has not enough resources.
Mitigation	See KubeCPUOvercommitPods and KubeMemOvercommitPods.

KubeDaemonSetMisScheduled¶

Removed in MCC 2.27.0 (17.2.0 and 16.2.0)

Root cause	At least one node where the DaemonSet Pods were deployed got a `NoSchedule` taint added afterward. Taints are respected during the scheduling stage only, and the Pod is currently considered unschedulable to such nodes.
Investigation	List the taints of all Kubernetes cluster nodes: kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints Verify the DaemonSet tolerations and currently occupied nodes: kubectl get daemonset -n <daemonset_namespace> <daemonset_name> -o \ custom-columns=NAME:.metadata.name,TOLERATIONS:.spec.tolerations,NODE:.spec.nodeName Compare the output of the two commands above and define the nodes that should not have DaemonSet Pods deployed.
Mitigation	If the DaemonSet Pod should run on the affected nodes, add toleration for the corresponding taint in the DaemonSet. If the DaemonSet Pod should not run on the affected nodes, delete the DaemonSet Pods from all nodes with a non-tolerated taint.

KubeDaemonSetOutage¶

Related inhibiting alert: KubeDaemonSetRolloutStuck.

Root cause	Although the DaemonSet was not scaled down to zero, there are zero healthy Pods. As each DaemonSet Pod is deployed to a separate Kubernetes node, such situation is rare and typically caused by a broken configuration (ConfigMaps or Secrets) or wrongly tuned resource limits.
Investigation	Verify the DaemonSet status: kubectl get daemonset -n <daemonset_namespace> <daemonset_name> Inspect the related Kubernetes events for error messages and probe failures: kubectl describe daemonset -n <damonset_namespace> <damonset_name> If events are unavailable, inspect K8S Events in the OpenSearch Dashboards web UI. List the Deployment Pods and verify them one by one. Use the label selectors, if possible: kubectl get pods -n <damonset_namespace> -l '<damonset_app_label>=<damonset_app_name>'
Mitigation	See KubePodsCrashLooping.

See also

Kubernetes documentation: DaemonSet

KubeCronJobRunning¶

Related alert: ClockSkewDetected.

Root cause	A CronJob Pod fails to start in 15 minutes from the configured schedule due to the following possible root causes: The previously scheduled Pod is still running and the CronJob `.spec.concurrencyPolicy` was set to `Forbid`. The scheduled Job could not start in the CronJob `.spec.startingDeadlineSeconds`, if set.
Investigation	Inspect the running CronJob Pods. Drop the label selector if none is available. kubectl get pods -n <cronjob_namespace> -l \ '<cronjob_app_label>=<cronjob_app_name>' -o=json \| jq -r '.items[] \ \| select(.status.phase=="Running") \| .metadata.name' If Pod uptime is unusually long, it can overlap with the upcoming Jobs. Verify the `concurrencyPolicy` setting: kubectl get cronjob -n <cronjob_namespace> <conrjob_name> -o=json \| \ jq -r '.spec.concurrencyPolicy == "Forbid"' If the output is `true`, Kubernetes will not allow new Pods to run until the current one terminates. In this case, investigate and fix the issue on the application level. Collect logs and inspect the Pod resources usage: kubectl logs -n <cronjob_namespace> <cronjob_pod_name> If all CronJob Pods terminate normally, inspect Kubernetes events for the CronJob: kubectl describe cronjob -n <cronjob_namespace> <cronjob_name> In case of events similar to Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.: Verify if the `ClockSkewDetected` alert is firing for the affected cluster. Verify the current starting deadline value: kubectl get cronjob -n <cronjob_namespace> <conrjob_name> \ -o=json \| jq -r '.spec.startingDeadlineSeconds'
Mitigation	For root cause 1, fix the issue on the application level. For root cause 2: If the `ClockSkewDetected` alert is firing for the affected cluster, resolve it first. If the CronJob issue is still present, depending on your application, remove or increase the `.spec.startingDeadlineSeconds` value.

Root cause

A CronJob Pod fails to start in 15 minutes from the configured schedule due to the following possible root causes:

The previously scheduled Pod is still running and the CronJob .spec.concurrencyPolicy was set to Forbid.
The scheduled Job could not start in the CronJob .spec.startingDeadlineSeconds, if set.

Investigation

Inspect the running CronJob Pods. Drop the label selector if none is available.

kubectl get pods -n <cronjob_namespace> -l \
'<cronjob_app_label>=<cronjob_app_name>' -o=json | jq -r '.items[] \
| select(.status.phase=="Running") | .metadata.name'

If Pod uptime is unusually long, it can overlap with the upcoming Jobs. Verify the concurrencyPolicy setting:
```
kubectl get cronjob -n <cronjob_namespace> <conrjob_name> -o=json | \
jq -r '.spec.concurrencyPolicy == "Forbid"'
```
If the output is true, Kubernetes will not allow new Pods to run until the current one terminates. In this case, investigate and fix the issue on the application level.

Collect logs and inspect the Pod resources usage:

kubectl logs -n <cronjob_namespace> <cronjob_pod_name>

If all CronJob Pods terminate normally, inspect Kubernetes events for the CronJob:
```
kubectl describe cronjob -n <cronjob_namespace> <cronjob_name>
```
In case of events similar to Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.:
1. Verify if the ClockSkewDetected alert is firing for the affected cluster.
2. Verify the current starting deadline value:
```
kubectl get cronjob -n <cronjob_namespace> <conrjob_name> \
-o=json | jq -r '.spec.startingDeadlineSeconds'
```

Mitigation

For root cause 1, fix the issue on the application level.
For root cause 2:
1. If the ClockSkewDetected alert is firing for the affected cluster, resolve it first.
2. If the CronJob issue is still present, depending on your application, remove or increase the .spec.startingDeadlineSeconds value.

KubeJobFailed¶

Related inhibited alert: KubePodsNotReady.

Root cause	At least one container of a Pod started by the Job exited with a non-zero status or was terminated by the Kubernetes or Linux system.
Investigation	See KubePodsCrashLooping.
Mitigation	Investigate and fix the root cause of missing Pod requirements, such as failing dependency application, Docker registry unavailability, unresponsive storage provided, and so on. Use the Mitigation section in KubePodsCrashLooping. Verify and resolve network-related alerts firing in the Kubernetes cluster.

See also

Kubernetes documentation: Jobs