Helm releases get stuck in FAILED or UNKNOWN state during cluster deployment

During a management, regional, or managed cluster deployment, Helm releases can get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false. HelmBundle cannot recover from such states and requires manual actions.

The issue resolution below describes recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

To apply the issue resolution:

  1. Verify the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

    kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}
    
    In the command above and in the steps below, replace the parameters
    enclosed in angle brackets with the corresponding values of your cluster.
    

    Example of system response:

    stacklight:
    attempt: 2
    chart: ""
    finishedAt: "2021-02-05T09:41:05Z"
    hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
    message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
      updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
      \"helmbundles.lcm.mirantis.com\" already exists"}]'
    notes: ""
    status: UNKNOWN
    success: false
    version: 0.1.2-mcp-398
    
  2. Log in to the helm-controller pod console:

    kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
    
  3. Download the Helm v3 binary. For details, see official Helm documentation.

  4. Remove the failed release:

    helm delete <failed-release-name>
    

    For example:

    helm delete stacklight
    

    Once done, the release triggers for redeployment.