OpenSearch cluster deadlock due to the corrupted index¶
Due to instability issues in a cluster, for example, after disaster recovery,
networking issues, or low resources, some OpenSearch master
pods may remain
in the PostStartHookError
due to the corrupted .opendistro-ism-config
index.
To verify that the cluster is affected:
The cluster is affected only when both conditions are met:
One or two
opensearch-master
pods are stuck in thePostStartHookError
state.The following example contains two failed pods:
kubectl get pod -n stacklight | grep opensearch-master opensearch-master-0 1/1 Running 0 41d opensearch-master-1 0/1 PostStartHookError 1659 (2m12s ago) 41d opensearch-master-2 0/1 PostStartHookError 1660 (6m6s ago) 41d
In the logs of the
opensearch
container of the affected pods, the followingWARN
message is present:kubectl logs opensearch-master-1 -n stacklight -c opensearch ... [2024-06-05T08:30:26,241][WARN ][r.suppressed ] [opensearch-master-1] path: /_plugins/_ism/policies/audit_rollover_policy, params: {policyID=audit_rollover_policy, if_seq_no=30554, if_primary_term=3} org.opensearch.action.support.replication.ReplicationOperation$RetryOnPrimaryException: shard is not in primary mode ...
The message itself can differ, but the following two parts of this message indicate that the cluster is affected:
The
/_plugins/_ism
prefix in the pathThe
shard is not in primary mode
exception
To apply the issue resolution:
Decrease the number of replica shards from
1
to0
for the.opendistro-ism-config
internal index:Log in to the pod that is not affected by this issue, for example,
opensearch-master-0
:kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash
Verify that the
.opendistro-ism-config
index number of replicas is"1"
:curl "http://localhost:9200/.opendistro-ism-config/_settings" | jq '.".opendistro-ism-config".settings.index.number_of_replicas'
Example of system response:
"1"
Decrease replicas from
1
to0
:curl -X PUT -H 'Content-Type: application/json' "http://localhost:9200/.opendistro-ism-config/_settings" -d '{"index.number_of_replicas": 0 }'
Verify that the
.opendistro-ism-config
index number of replicas is"0"
.Wait around 30 minutes and verify whether the affected pods started normally or are still failing in the
PostStartHookError
loop.If the pods started, increase the number of replicas for the
.opendistro-ism-config
index back to1
again.If the pods did not start, proceed to the following step.
Remove the internal
.opendistro-ism-config
index to recreate it again:Remove the index:
curl -X DELETE "http://localhost:9200/.opendistro-ism-config"
Wait until all shards of this index are removed, which usually takes up to 10-15 seconds:
curl localhost:9200/_cat/shards | grep opendistro-ism-config
The system response must be empty.
This internal index will be recreated on the next
PostStartHook
execution of any affected replica.Wait up to 30 minutes, assuming that during this time at least one attempt of
PostStartHook
execution occurs, and verify that the internal index was recreated:curl localhost:9200/_cat/shards | grep opendistro-ism-config
The system response must contain two shards in the output, for example:
.opendistro-ism-config 0 p STARTED 10.233.118.238 opensearch-master-2 .opendistro-ism-config 0 r STARTED 10.233.113.58 opensearch-master-1
Wait up to 30 minutes and verify whether the affected pods started normally.
Before 2.27.0 (Cluster releases 17.2.0 and 16.2.0), verify that the cluster is not affected by the issue 40020. If it is affected, proceed to the corresponding workaround.