OpenSearch cluster deadlock due to the corrupted index

Due to instability issues in a cluster, for example, after disaster recovery, networking issues, or low resources, some OpenSearch master pods may remain in the PostStartHookError due to the corrupted .opendistro-ism-config index.

To verify that the cluster is affected:

The cluster is affected only when both conditions are met:

  • One or two opensearch-master pods are stuck in the PostStartHookError state.

    The following example contains two failed pods:

    kubectl get pod -n stacklight | grep opensearch-master
    
    opensearch-master-0    1/1   Running              0                  41d
    opensearch-master-1    0/1   PostStartHookError   1659 (2m12s ago)   41d
    opensearch-master-2    0/1   PostStartHookError   1660 (6m6s ago)    41d
    
  • In the logs of the opensearch container of the affected pods, the following WARN message is present:

    kubectl logs opensearch-master-1 -n stacklight -c opensearch
    
    ...
    [2024-06-05T08:30:26,241][WARN ][r.suppressed             ] [opensearch-master-1] path: /_plugins/_ism/policies/audit_rollover_policy, params: {policyID=audit_rollover_policy, if_seq_no=30554, if_primary_term=3}
    org.opensearch.action.support.replication.ReplicationOperation$RetryOnPrimaryException: shard is not in primary mode
    ...
    

    The message itself can differ, but the following two parts of this message indicate that the cluster is affected:

    • The /_plugins/_ism prefix in the path

    • The shard is not in primary mode exception

To apply the issue resolution:

  1. Decrease the number of replica shards from 1 to 0 for the .opendistro-ism-config internal index:

    1. Log in to the pod that is not affected by this issue, for example, opensearch-master-0:

      kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash
      
    2. Verify that the .opendistro-ism-config index number of replicas is "1":

      curl "http://localhost:9200/.opendistro-ism-config/_settings" | jq '.".opendistro-ism-config".settings.index.number_of_replicas'
      

      Example of system response:

      "1"
      
    3. Decrease replicas from 1 to 0:

      curl -X PUT -H 'Content-Type: application/json' "http://localhost:9200/.opendistro-ism-config/_settings" -d '{"index.number_of_replicas": 0 }'
      
    4. Verify that the .opendistro-ism-config index number of replicas is "0".

    5. Wait around 30 minutes and verify whether the affected pods started normally or are still failing in the PostStartHookError loop.

      • If the pods started, increase the number of replicas for the .opendistro-ism-config index back to 1 again.

      • If the pods did not start, proceed to the following step.

  2. Remove the internal .opendistro-ism-config index to recreate it again:

    1. Remove the index:

      curl -X DELETE "http://localhost:9200/.opendistro-ism-config"
      
    2. Wait until all shards of this index are removed, which usually takes up to 10-15 seconds:

      curl localhost:9200/_cat/shards | grep opendistro-ism-config
      

      The system response must be empty.

      This internal index will be recreated on the next PostStartHook execution of any affected replica.

    3. Wait up to 30 minutes, assuming that during this time at least one attempt of PostStartHook execution occurs, and verify that the internal index was recreated:

      curl localhost:9200/_cat/shards | grep opendistro-ism-config
      

      The system response must contain two shards in the output, for example:

      .opendistro-ism-config    0 p STARTED    10.233.118.238 opensearch-master-2
      .opendistro-ism-config    0 r STARTED    10.233.113.58  opensearch-master-1
      
    4. Wait up to 30 minutes and verify whether the affected pods started normally.

    5. Before 2.27.0 (Cluster releases 17.2.0 and 16.2.0), verify that the cluster is not affected by the issue 40020. If it is affected, proceed to the corresponding workaround.