Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
OpenSearch cluster deadlock due to the corrupted index¶
Due to instability issues in a cluster, for example, after disaster recovery,
networking issues, or low resources, some OpenSearch master pods may remain
in the PostStartHookError due to the corrupted .opendistro-ism-config
index.
To verify that the cluster is affected:
The cluster is affected only when both conditions are met:
One or two
opensearch-masterpods are stuck in thePostStartHookErrorstate.The following example contains two failed pods:
kubectl get pod -n stacklight | grep opensearch-master opensearch-master-0 1/1 Running 0 41d opensearch-master-1 0/1 PostStartHookError 1659 (2m12s ago) 41d opensearch-master-2 0/1 PostStartHookError 1660 (6m6s ago) 41d
In the logs of the
opensearchcontainer of the affected pods, the followingWARNmessage is present:kubectl logs opensearch-master-1 -n stacklight -c opensearch ... [2024-06-05T08:30:26,241][WARN ][r.suppressed ] [opensearch-master-1] path: /_plugins/_ism/policies/audit_rollover_policy, params: {policyID=audit_rollover_policy, if_seq_no=30554, if_primary_term=3} org.opensearch.action.support.replication.ReplicationOperation$RetryOnPrimaryException: shard is not in primary mode ...
The message itself can differ, but the following two parts of this message indicate that the cluster is affected:
The
/_plugins/_ismprefix in the pathThe
shard is not in primary modeexception
To apply the issue resolution:
Decrease the number of replica shards from
1to0for the.opendistro-ism-configinternal index:Log in to the pod that is not affected by this issue, for example,
opensearch-master-0:kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash
Verify that the
.opendistro-ism-configindex number of replicas is"1":curl "http://localhost:9200/.opendistro-ism-config/_settings" | jq '.".opendistro-ism-config".settings.index.number_of_replicas'
Example of system response:
"1"Decrease replicas from
1to0:curl -X PUT -H 'Content-Type: application/json' "http://localhost:9200/.opendistro-ism-config/_settings" -d '{"index.number_of_replicas": 0 }'
Verify that the
.opendistro-ism-configindex number of replicas is"0".Wait around 30 minutes and verify whether the affected pods started normally or are still failing in the
PostStartHookErrorloop.If the pods started, increase the number of replicas for the
.opendistro-ism-configindex back to1again.If the pods did not start, proceed to the following step.
Remove the internal
.opendistro-ism-configindex to recreate it again:Remove the index:
curl -X DELETE "http://localhost:9200/.opendistro-ism-config"
Wait until all shards of this index are removed, which usually takes up to 10-15 seconds:
curl localhost:9200/_cat/shards | grep opendistro-ism-config
The system response must be empty.
This internal index will be recreated on the next
PostStartHookexecution of any affected replica.Wait up to 30 minutes, assuming that during this time at least one attempt of
PostStartHookexecution occurs, and verify that the internal index was recreated:curl localhost:9200/_cat/shards | grep opendistro-ism-config
The system response must contain two shards in the output, for example:
.opendistro-ism-config 0 p STARTED 10.233.118.238 opensearch-master-2 .opendistro-ism-config 0 r STARTED 10.233.113.58 opensearch-master-1
Wait up to 30 minutes and verify whether the affected pods started normally.
Before 2.27.0 (Cluster releases 17.2.0 and 16.2.0), verify that the cluster is not affected by the issue 40020. If it is affected, proceed to the corresponding workaround.