Helm releases get stuck in FAILED or UNKNOWN state during cluster deployment


The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.

During a management, regional, or managed cluster deployment, Helm v2 releases can get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false. HelmBundle cannot recover from such states and requires manual actions.

The issue resolution below describes recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

To apply the issue resolution:

  1. Verify that the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

    kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}
    In the command above and in the steps below, replace the parameters
    enclosed in angle brackets with the corresponding values of your cluster.

    Example of system response:

    attempt: 2
    chart: ""
    finishedAt: "2021-02-05T09:41:05Z"
    hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
    message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
      updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
      \"helmbundles.lcm.mirantis.com\" already exists"}]'
    notes: ""
    status: UNKNOWN
    success: false
    version: 0.1.2-mcp-398
  2. Log in to the affected helm-controller-<podName> Pod console:

    kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it <helmControllerPodName> -c tiller -- sh

    To obtain the helm-controller Pod names:

    kubectl --kubeconfig <affectedClusterKubeconfigPath> get pods -n kube-system | grep helm-controller
  3. Download the Helm v3 binary. For details, see official Helm documentation.

  4. Remove the failed release:

    ./helm --host localhost:44134 delete <failed-release-name>

    For example:

    ./helm --host localhost:44134 delete stacklight

    Once done, the release triggers for redeployment.