Troubleshoot cAdvisor alerts

This section describes the investigation and troubleshooting steps for the cAdvisor service.


Root cause

The alert is based on the metric container_cpu_cfs_throttled_periods_total over container_cpu_cfs_periods_total and means the percentage of CPU periods where the container ran but was throttled (stopped from running the whole CPU period).


The alert usually fires when a Pod starts, often during brief intervals. It may solve automatically once the Pod CPU usage stabilizes. If the issue persists:

  1. Obtain the created_by_name label from the alert.

  2. List the affected Pods using the created_by_name label:

    kubectl get pods -n stacklight -o json | jq -r '.items[] | \
    select(.metadata.ownerReferences[] | select(.name=="<created_by_name>")) |'

    In the system response, obtain one or more affected Pod names.

  3. List the affected containers. Using <pod_name> obtained in the previous step, run the following query in the Prometheus query window:

    sum by (container) (rate(container_cpu_usage_seconds_total{pod="<pod_name>", container!="POD", container!=""}[3m]))
  4. Verify the current request and limit difference received from Prometheus with the values from the Pod configuration for every container respectively:

    kubectl describe <created_by_kind> <created_by_name>

    In the command above, replace <created_by_kind> and <created_by_name> with the corresponding alert values.

    If some of containers lack resources, increase their limits.


As a possible solution, increase Pod limits.