Troubleshoot Kubernetes system alerts

This section describes the investigation and troubleshooting steps for the Kubernetes system alerts.


KubeNodeNotReady

Root cause

A node has entered the NotReady state and cannot run new Pods due to the following reasons:

  • Issues with the kubelet or kube-proxy processes.

  • High resources consumption (insufficient disk space, memory, CPU).

Investigation

  1. In OpenSearch Dashboards, navigate to the Discover section.

  2. Expand the time range and filter the results by the ucp-kubelet or ucp-kube-proxy logger.

  3. Set the severity_label field matcher to ERROR. In results, search for message.

  4. Inspect the status of the KubeCPUOvercommitPod and KubeMemOvercommitPods alerts to verify if PidPressure or DiskPressure takes place:

    kubectl describe node <node_name>
    
  5. In the Kubernetes Cluster Grafana dashboard, verify the resources consumption over time.

Mitigation

Contact Mirantis support for a detailed procedure on dealing with each of the root causes.

KubeletTooManyPods

Root cause

The number of Pods reached 90% of Kubernetes node capacity.

Investigation

  1. Verify the Pod capacity for nodes in your cluster:

    kubectl get node -o json | jq  '.items[] | \
    {node_name:.metadata.name, capacity:.status.capacity.pods}'
    
  2. Inspect the Non-terminated Pods section in the output of the following command:

    kubectl describe node <node_name>
    

Mitigation

  1. Verify the nodes capacity.

  2. Verify the Pods distribution:

    kubectl get pods --all-namespaces -o json --field-selector \
    spec.nodeName=<node> | jq -r '.items | length'
    
  3. If the distribution is extremely odd, investigate custom taints in underloaded nodes. If some of the custom taints are blocking Pods from being scheduled - consider adding tolerations or scaling the Container Cloud cluster out by adding worker nodes.

  4. If no custom taints exist, add worker nodes.

  5. Delete Pods that can be moved (preferably, multi-node Deployments).