Elasticsearch

Elasticsearch

This section describes the alerts for the Elasticsearch service.


ElasticsearchClusterHealthStatusMajor

Severity Major
Summary The Elasticsearch cluster status is YELLOW for 2 minutes.
Raise condition elasticsearch_cluster_health_status == 2
Description Raises when the Elasticsearch cluster status is YELLOW for 2 minutes, meaning that Elasticsearch has allocated all of the primary shards but some or all of the replicas have not been allocated. For the exact reason, inspect the Elasticsearch logs on the log nodes in /var/log/elasticsearch/elasticsearch.log. To verify the current status of the cluster, run curl -XGET '<host>:<port>/_cat/health?pretty', where host is elasticsearch:client:server:host and port is elasticsearch:client:server:port defined in your model.
Troubleshooting
  • Verify the status of the shards using curl -XGET '<host>:<port>/_cat/shards'. For details, see Cluster Allocation Explain API.

  • If UNASSIGNED shards are present, reallocate the shards by running the following command on the log nodes:

    curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \
    application/json'  -d' { "persistent": \
    {"cluster.routing.allocation.enable": "all" }}'
    
  • Manually reallocate the unassigned shards as described in Cluster Reroute.

Tuning Not required

ElasticsearchClusterHealthStatusCritical

Severity Critical
Summary The Elasticsearch cluster status is RED for 2 minutes.
Raise condition elasticsearch_cluster_health_status == 3
Description Raises when the Elasticsearch cluster status is RED for 2 minutes, meaning that some or all of primary shards are not ready. For the exact reason, inspect the Elasticsearch logs on the log nodes in /var/log/elasticsearch/elasticsearch.log. To verify the current status of the cluster, run curl -XGET '<host>:<port>/_cat/health?pretty', where host is elasticsearch:client:server:host and port is elasticsearch:client:server:port defined in your model.
Troubleshooting
  • Verify that the Elasticsearch service is running on all log nodes using service elasticsearch status.

  • Verify the status of the shards using curl -XGET '<host>:<port>/_cat/shards'. For details, see Cluster Allocation Explain API.

  • Enable shard allocation by running the following command on the log nodes:

    curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \
    application/json' -d' { "persistent": \
    {"cluster.routing.allocation.enable": "all" }}'
    
  • Manually reallocate the unassigned shards as described in Cluster Reroute.

For more troubleshooting details, see Official Elasticsearch documentation.

Tuning Not required

ElasticsearchServiceDown

Severity Minor
Summary The Elasticsearch service on the {{ $labels.host }} node is down.
Raise condition elasticsearch_up{host=~".*"} == 0
Description Raises when the Elasticsearch service is down on a log node. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Verify the status of the service by running systemctl status elasticsearch on the affected node.
  • Inspect the Elasticsearch logs in /var/log/elasticsearch/elasticsearch.log for the exact reason.
Tuning Not required

ElasticsearchServiceDownMinor

Severity Minor
Summary 30% of Elasticsearch services are down for 2 minutes.
Raise condition count(elasticsearch_up{host=~".*"} == 0) >= count(elasticsearch_up{host=~".*"}) * 0.3
Description Raises when the Elasticsearch service is down on more than 30% of the log nodes. By default, 3 log nodes are present, meaning that the service is down on one node.
Troubleshooting
  • Inspect the ElasticsearchServiceDown alerts for the host names of the affected nodes.
  • Verify the Elasticsearch status by running the systemctl status elasticsearch command on the affected node.
  • Inspect the Elasticsearch logs in /var/log/elasticsearch/elasticsearch.log for the exact reason.
Tuning Not required

ElasticsearchServiceDownMajor

Severity Major
Summary 60% of Elasticsearch services are down for 2 minutes.
Raise condition count(elasticsearch_up{host=~".*"} == 0) >= count(elasticsearch_up{host=~".*"}) * 0.6
Description Raises when the Elasticsearch service is down on the more than 60% of log nodes. By default, 3 log nodes are present, meaning that the service is down on two nodes.
Troubleshooting
  • Inspect the ElasticsearchServiceDown alerts for the host names of the affected nodes.
  • Verify the Elasticsearch status by running the systemctl status elasticsearch command on the affected node.
  • Inspect the Elasticsearch logs in /var/log/elasticsearch/elasticsearch.log for the exact reason.
Tuning Not required

ElasticsearchServiceOutage

Severity Critical
Summary All Elasticsearch services within the cluster are down.
Raise condition count(elasticsearch_up{host=~".*"} == 0) == count(elasticsearch_up{host=~".*"})
Description Raises when the Elasticsearch service is down on all log nodes.
Troubleshooting
  • Inspect the ElasticsearchServiceDown alerts for the host names of the affected nodes.
  • Verify the Elasticsearch status by running the systemctl status elasticsearch command on the affected node.
  • Inspect the Elasticsearch logs in /var/log/elasticsearch/elasticsearch.log for the exact reason.
Tuning Not required

ElasticsearchDiskWaterMarkMinor

Severity Minor
Summary The Elasticsearch {{ $labels.instance }} instance uses 60% of disk space on the {{ $labels.host }} node for 5 minutes.
Raise condition (max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.6
Description Raises when the Elasticsearch instance uses 60% of disk space on the log node for 5 minutes. To verify the available and used disk space, run df -h.
Troubleshooting
  • Free or extend the disk space on the Elasticsearch partition.
  • Decrease the default retention period for Elasticsearch as described in Configure Elasticsearch Curator.
Tuning

Typically, you should not change the default value. If the alert is constantly firing, verify the available disk space on the log nodes and adjust the threshold according to the available space. Additionally, in the Prometheus Web UI, use the raise condition query to view the graph for a longer period of time and define the best threshold.

For example, change the threshold to 80%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            ElasticsearchDiskWaterMarkMinor:
              if: >-
                (max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\
                 - max(elasticsearch_fs_total_available_in_bytes) by \
                 (host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\
                  by (host, instance) >= 0.8
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

ElasticsearchDiskWaterMarkMajor

Severity Major
Summary The Elasticsearch {{ $labels.instance }} instance uses 75% of disk space on the {{ $labels.host }} node for 5 minutes.
Raise condition (max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.75
Description Raises when the Elasticsearch instance uses 75% of disk space on the log node for 5 minutes. To verify the available and used disk space, run df -h.
Troubleshooting
  • Free or extend the disk space on the Elasticsearch partition.
  • Decrease the default retention period for Elasticsearch as described in Configure Elasticsearch Curator.
Tuning

Typically, you should not change the default value. If the alert is constantly firing, verify the available disk space on the log nodes and adjust the threshold according to the available space. Additionally, in the Prometheus Web UI, use the raise condition query to view the graph for a longer period of time and define the best threshold.

For example, change the threshold to 90%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            ElasticsearchDiskWaterMarkMajor:
              if: >-
                (max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\
                 - max(elasticsearch_fs_total_available_in_bytes) by \
                 (host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\
                  by (host, instance) >= 0.9
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

ElasticsearchExporterNoDailyLogs

Available since 2019.2.10

Severity Warning
Summary No new logs sent from a node within the last 3 hours.
Raise condition (sum by (host) (changes(logs_program_host_doc_count[3h])) or sum by (host) (up{host!=""})*0) == 0
Description Raises when no new logs were shipped from the {{ $labels.host }} node within the last 3 hours.
Troubleshooting Verify that Fluentd is operating properly on the affected node.
Tuning No tuning