The ‘database space exceeded’ error on large clusters

Occasionally, cluster upgrade may get stuck on large clusters running 500+ nodes along with 15k+ pods due to the etcd database overflow. The following error occurs every time when accessing the Kubernetes API server:

etcdserver: mvcc: database space exceeded

Normally, kube-apiserver actively compacts the etcd database. In rare cases, it is required to manually compact the etcd database as described below, for example, during rapid creation of numerous Kubernetes objects. Once done, Mirantis recommends that you identify the root cause of the issue and clean up unnecessary resources to prevent manual etcd compacting and defragmentation in future.

To apply the issue resolution:

  1. Open an SSH connection to any controller node.

  2. Execute the following script to compact and defragment the etcd database:

    sudo -i
    compact_etcd.sh
    defrag_etcd.sh
    

Defragment the etcd database as described in MKE documentation: Apply etcd defragmentation.