Helm releases get stuck in FAILED or UNKNOWN state during cluster deployment¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm v2 releases can get stuck in the FAILED
or UNKNOWN
state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready
and
providerStatus.helm.releaseStatuses.<releaseName>.success
are false
.
HelmBundle cannot recover from such states and requires manual actions.
The issue resolution below describes recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
To apply the issue resolution:
Verify that the failed release has the
UNKNOWN
orFAILED
status in the HelmBundle object:kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight} In the command above and in the steps below, replace the parameters enclosed in angle brackets with the corresponding values of your cluster.
Example of system response:
stacklight: attempt: 2 chart: "" finishedAt: "2021-02-05T09:41:05Z" hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623 message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io \"helmbundles.lcm.mirantis.com\" already exists"}]' notes: "" status: UNKNOWN success: false version: 0.1.2-mcp-398
Log in to the affected
helm-controller-<podName>
Pod console:kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it <helmControllerPodName> -c tiller -- sh
To obtain the
helm-controller
Pod names:kubectl --kubeconfig <affectedClusterKubeconfigPath> get pods -n kube-system | grep helm-controller
Download the Helm v3 binary. For details, see official Helm documentation.
Remove the failed release:
./helm --host localhost:44134 delete <failed-release-name>
For example:
./helm --host localhost:44134 delete stacklight
Once done, the release triggers for redeployment.