Troubleshoot Kubernetes storage alerts

This section describes the investigation and troubleshooting steps for the Kubernetes storage alerts.


KubePersistentVolumeUsageCritical

Related inhibited alert: KubePersistentVolumeFullInFourDays.

Root cause

Persistent volume (PV) has less than 3% of free space. Applications that rely on writing to the disk will crash without space available.

Investigation and mitigation

Refer to KubePersistentVolumeFullInFourDays.

KubePersistentVolumeFullInFourDays

Root cause

The PV has less than 15% of total space available. Based on the predict_linear() Prometheus function, it is expected to fill up in four days.

Investigation

  1. Verify the current PV size:

    kubectl get pv <pv_name> -o=jsonpath='{.spec.capacity.storage}'
    
  2. Verify the configured application retention period.

  3. Optional. Review the data stored on the PV, including the application data, logs, and so on, to verify the space consumption and eliminate potential overuse:

    1. Obtain the name of the Pod that uses the PV:

      kubectl get pods -n <namespace> -o json | jq -r '.items[] | \
      select(.spec.volumes[] | \
      select(.persistentVolumeClaim.claimName=="<persistentvolumeclaim>")) \
      | .metadata.name'
      

      Substitute <persistentvolumeclaim> with the value from the alert persistentvolumeclaim label.

    2. Obtain the name of the container that has the volume mounted:

      kubectl describe pod -n <namespace> <pod_name>
      
    3. Execute the Pod and determine the files consuming the most space:

      kubectl exec -it -n <namespace> <pod_name> -- /bin/bash
      

Mitigation

Select from the following options:

  • Decrease the application retention time, if applicable.

  • Resize the PV, if possible, or create a new PV, migrate data, and switch the volumes using rolling update.

KubePersistentVolumeErrors

Root cause

Some PVs are in the Failed or Pending state.

Investigation

  1. Verify the PVs status:

    kubectl get pv -o json | jq -r '.items[] | select(.status.phase=="Pending" or .status.phase=="Failed") | .metadata.name'
    
  2. For the PVs in the Failed or Pending state:

    kubectl describe pv <pv_name>
    

    Inspect Kubernetes events, if available. Otherwise:

    1. In the Discover section of the OpenSearch Dashboards web UI, change the index pattern to kubernetes_events-*.

    2. Expand the time range and filter the results by kubernetes.event.involved_object.name, which equals to the <pv_name> from the previous step. In the matched results, find the kubernetes.event.message field.

  3. If the PV is in the Pending state, it waits to be provisioned. Verify the PV storage class name:

    kubectl get pv <pv_name> -o=json | jq -r '.spec.storageClassName'
    
  4. Verify the provisioner name specified for the storage class:

    kubectl get sc <sc_name> -o=json | jq -r '.spec.provisioner
    
  5. If the provisioner is deployed as a workload in the affected Kubernetes cluster, verify if it experiences availability or health issues. Further investigation and mitigation depends on the provisioner. The Failed state can be caused by a custom recycler error when a deprecated Recycle reclaim policy is used.

Mitigation

  • Fix the PV in Pending state according to the investigation outcome.

    Warning

    Deleting a PV causes data loss. Removing PVCs causes deletion of a PV with the Delete reclaim policy.

  • Fix the PV in the Failed state:

    1. Investigate the recycler Pod by verifying the kube-controller-manager configuration. Search for the PV in the Pod logs.

    2. Delete the Pod and mounted PVC if it is still in the Terminating state.