Troubleshoot Kubernetes resources alerts

This section describes the investigation and troubleshooting steps for the Kubernetes resources alerts.


KubeCPUOvercommitPods

Root cause

The sum of Kubernetes Pods CPU requests is higher than the average capacity of the cluster without one node or 80% of total nodes CPU capacity, depending on what is higher. It is a common issue of a cluster with too many resources deployed.

Investigation

Select one of the following options to verify nodes CPU requests:

  • Inspect the allocated resources section in the output of the following command:

    kubectl describe nodes
    
  • Inspect the Cluster CPU Capacity panel of the Kubernetes Cluster Grafana dashboard.

Mitigation

Increase the node(s) CPU capacity or add a worker node(s).

KubeMemOvercommitPods

Root cause

The sum of Kubernetes Pods RAM requests is higher than the average capacity of the cluster without one node or 80% of total nodes RAM capacity, depending on what is higher. It is a common issue of a cluster with too many resources deployed.

Investigation

Select one of the following options to verify nodes RAM requests:

  • Inspect the allocated resources section in the output of the following command:

    kubectl describe nodes
    
  • Inspect the Cluster Mem Capacity panel of the Kubernetes Cluster Grafana dashboard.

Mitigation

Increase the node(s) CPU capacity or add a worker node(s).

KubeContainersCPUThrottlingHigh

Root cause

The alert is based on the container_cpu_cfs_throttled_periods_total metric over container_cpu_cfs_periods_total and indicates the percentage of CPU periods when the container ran but was throttled (stopped during the entire CPU period).

Investigation

The alert typically fires upon a Pod start and during short periods of time. It may solve automatically once the Pod CPU usage stabilizes. If the issue persists, determine the container that is causing the issue:

  1. List the affected Pod(s):

    kubectl get pods -n stacklight -o json | jq -r '.items[] | \
    select(.metadata.ownerReferences[] | \
    select(.name=="<created_by_name>")) | .metadata.name'
    

    Substitute <created_by_name> with the value from the respective alert label.

  2. In the Prometheus web UI, paste the following query with the Pod name obtained in the previous step to list the affected container(s):

    sum by (container) \
    (rate(container_cpu_usage_seconds_total{pod="<pod_name>", \
    container!="POD", container!=""}[3m]))
    
  3. Verify the difference between the current request and limit received from Prometheus and the values from the Pod(s) configuration for every container respectively:

    kubectl describe <created_by_kind> <created_by_name>
    

    Substitute <created_by_kind> and <created_by_name> with the values from the respective alert labels.

    If some containers lack resources, increase the limits.

Mitigation

Increase the Pod(s) limits for containers that lack resources.