Troubleshoot Kubernetes system alerts

This section describes the investigation and troubleshooting steps for the Kubernetes system alerts.


KubeNodeNotReady

Root cause

A node has entered the NotReady state and cannot run new Pods due to the following reasons:

  • Issues with the kubelet or kube-proxy processes.

  • High resources consumption (insufficient disk space, memory, CPU).

Investigation

  1. In OpenSearch Dashboards, navigate to the Discover section.

  2. Expand the time range and filter the results by the ucp-kubelet or ucp-kube-proxy logger.

  3. Set the severity_label field matcher to ERROR. In results, search for message.

  4. Inspect the status of the KubeCPUOvercommitPod and KubeMemOvercommitPods alerts to verify if PidPressure or DiskPressure takes place:

    kubectl describe node <node_name>
    
  5. In the Kubernetes Cluster Grafana dashboard, verify the resources consumption over time.

Mitigation

Contact Mirantis support for a detailed procedure on dealing with each of the root causes.

KubeletTooManyPods

Root cause

The number of Pods reached 90% of Kubernetes node capacity.

Investigation

  1. Verify the Pod capacity for nodes in your cluster:

    kubectl get node -o json | jq  '.items[] | \
    {node_name:.metadata.name, capacity:.status.capacity.pods}'
    
  2. Inspect the Non-terminated Pods section in the output of the following command:

    kubectl describe node <node_name>
    

Mitigation

  1. Verify the nodes capacity.

  2. Verify the Pods distribution:

    kubectl get pods --all-namespaces -o json --field-selector \
    spec.nodeName=<node> | jq -r '.items | length'
    
  3. If the distribution is extremely odd, investigate custom taints in underloaded nodes. If some of the custom taints are blocking Pods from being scheduled - consider adding tolerations or scaling the Container Cloud cluster out by adding worker nodes.

  4. If no custom taints exist, add worker nodes.

  5. Delete Pods that can be moved (preferably, multi-node Deployments).

KubeStateMetricsTargetDown

Root cause

Prometheus scraping of the kube-state-metrics service is unreliable, resulting in the success rate below 90%. It indicates either failure of the kube-state-metrics Pod or (in rare scenarios) network issues causing scrape requests to timeout.

Related alert: KubeDeploymentOutage{deployment=prometheus-kube-state-metrics} (inhibiting).

Investigation

In the Prometheus web UI, search for firing alerts that relate to networking issues in the Container Cloud cluster and fix them.

If the cluster network is healthy, refer to the Investigation section of the KubePodsCrashLooping alert troubleshooting description to collect information about CoreDNS Pods.

Mitigation

Based on the investigation results, select from the following options:

  • Fix the networking issues

  • Apply solutions from Mitigation section of the KubePodsCrashLooping alert troubleshooting description

If the issue still persists, collect the investigation output and contact Mirantis support for further information.

KubernetesMasterAPITargetsOutage

Root cause

The Prometheus Blackbox Exporter target probing /healthz endpoints of the Kubernetes API server nodes is not reliably available. Prometheus metric scrapes fail. It indicates either the prometheus-blackbox-exporter Pod failure or (in rare cases) network issues causing scrape requests to time out.

Related alert: KubeDeploymentOutage{deployment=prometheus-kube-blackbox-exporter} (inhibiting).

Investigation

In the Prometheus web UI, search for firing alerts that relate to networking issues in the Container Cloud cluster and fix them.

If the cluster network is healthy, refer to the Investigation section of the KubePodsCrashLooping alert troubleshooting description to collect information about prometheus-blackbox-exporter Pods.

Mitigation

Based on the investigation results, select from the following options:

  • Fix the networking issues

  • Apply solutions from Mitigation section of the KubePodsCrashLooping alert troubleshooting description

If the issue still persists, collect the investigation output and contact Mirantis support for further information.