Documentation Portal

Elasticsearch

Elasticsearch¶

This section describes the alerts for the Elasticsearch service.

ElasticsearchClusterHealthStatusMajor
ElasticsearchClusterHealthStatusCritical
ElasticsearchServiceDown
ElasticsearchServiceDownMinor
ElasticsearchServiceDownMajor
ElasticsearchServiceOutage
ElasticsearchDiskWaterMarkMinor
ElasticsearchDiskWaterMarkMajor
ElasticsearchExporterNoDailyLogs

ElasticsearchClusterHealthStatusMajor¶

Severity	Major
Summary	The Elasticsearch cluster status is `YELLOW` for 2 minutes.
Raise condition	`elasticsearch_cluster_health_status == 2`
Description	Raises when the Elasticsearch cluster status is `YELLOW` for 2 minutes, meaning that Elasticsearch has allocated all of the primary shards but some or all of the replicas have not been allocated. For the exact reason, inspect the Elasticsearch logs on the `log` nodes in `/var/log/elasticsearch/elasticsearch.log`. To verify the current status of the cluster, run `curl -XGET '<host>:<port>/_cat/health?pretty'`, where `host` is `elasticsearch:client:server:host` and `port` is `elasticsearch:client:server:port` defined in your model.
Troubleshooting	Verify the status of the shards using `curl -XGET '<host>:<port>/_cat/shards'`. For details, see Cluster Allocation Explain API. If `UNASSIGNED` shards are present, reallocate the shards by running the following command on the `log` nodes: curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \ application/json' -d' { "persistent": \ {"cluster.routing.allocation.enable": "all" }}' Manually reallocate the unassigned shards as described in Cluster Reroute.
Tuning	Not required

ElasticsearchClusterHealthStatusCritical¶

Severity	Critical
Summary	The Elasticsearch cluster status is `RED` for 2 minutes.
Raise condition	`elasticsearch_cluster_health_status == 3`
Description	Raises when the Elasticsearch cluster status is `RED` for 2 minutes, meaning that some or all of primary shards are not ready. For the exact reason, inspect the Elasticsearch logs on the `log` nodes in `/var/log/elasticsearch/elasticsearch.log`. To verify the current status of the cluster, run `curl -XGET '<host>:<port>/_cat/health?pretty'`, where `host` is `elasticsearch:client:server:host` and `port` is `elasticsearch:client:server:port` defined in your model.
Troubleshooting	Verify that the Elasticsearch service is running on all `log` nodes using `service elasticsearch status`. Verify the status of the shards using `curl -XGET '<host>:<port>/_cat/shards'`. For details, see Cluster Allocation Explain API. Enable shard allocation by running the following command on the `log` nodes: curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \ application/json' -d' { "persistent": \ {"cluster.routing.allocation.enable": "all" }}' Manually reallocate the unassigned shards as described in Cluster Reroute. For more troubleshooting details, see Official Elasticsearch documentation.
Tuning	Not required

ElasticsearchServiceDown¶

Severity	Minor
Summary	The Elasticsearch service on the `{{ $labels.host }}` node is down.
Raise condition	`elasticsearch_up{host=~".*"} == 0`
Description	Raises when the Elasticsearch service is down on a `log` node. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the status of the service by running `systemctl status elasticsearch` on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchServiceDownMinor¶

Severity	Minor
Summary	30% of Elasticsearch services are down for 2 minutes.
Raise condition	`count(elasticsearch_up{host=~"."} == 0) >= count(elasticsearch_up{host=~"."}) * 0.3`
Description	Raises when the Elasticsearch service is down on more than 30% of the `log` nodes. By default, 3 `log` nodes are present, meaning that the service is down on one node.
Troubleshooting	Inspect the `ElasticsearchServiceDown` alerts for the host names of the affected nodes. Verify the Elasticsearch status by running the `systemctl status elasticsearch` command on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchServiceDownMajor¶

Severity	Major
Summary	60% of Elasticsearch services are down for 2 minutes.
Raise condition	`count(elasticsearch_up{host=~"."} == 0) >= count(elasticsearch_up{host=~"."}) * 0.6`
Description	Raises when the Elasticsearch service is down on the more than 60% of `log` nodes. By default, 3 `log` nodes are present, meaning that the service is down on two nodes.
Troubleshooting	Inspect the `ElasticsearchServiceDown` alerts for the host names of the affected nodes. Verify the Elasticsearch status by running the `systemctl status elasticsearch` command on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchServiceOutage¶

Severity	Critical
Summary	All Elasticsearch services within the cluster are down.
Raise condition	`count(elasticsearch_up{host=~"."} == 0) == count(elasticsearch_up{host=~"."})`
Description	Raises when the Elasticsearch service is down on all `log` nodes.
Troubleshooting	Inspect the `ElasticsearchServiceDown` alerts for the host names of the affected nodes. Verify the Elasticsearch status by running the `systemctl status elasticsearch` command on the affected node. Inspect the Elasticsearch logs in `/var/log/elasticsearch/elasticsearch.log` for the exact reason.
Tuning	Not required

ElasticsearchDiskWaterMarkMinor¶

Severity	Minor
Summary	The Elasticsearch `{{ $labels.instance }}` instance uses 60% of disk space on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`(max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.6`
Description	Raises when the Elasticsearch instance uses 60% of disk space on the `log` node for 5 minutes. To verify the available and used disk space, run `df -h`.
Troubleshooting	Free or extend the disk space on the Elasticsearch partition. Decrease the default retention period for Elasticsearch as described in Configure Elasticsearch Curator.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available disk space on the `log` nodes and adjust the threshold according to the available space. Additionally, in the Prometheus Web UI, use the raise condition query to view the graph for a longer period of time and define the best threshold. For example, change the threshold to `80%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ElasticsearchDiskWaterMarkMinor: if: >- (max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\ - max(elasticsearch_fs_total_available_in_bytes) by \ (host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\ by (host, instance) >= 0.8 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ElasticsearchDiskWaterMarkMajor¶

Severity	Major
Summary	The Elasticsearch `{{ $labels.instance }}` instance uses 75% of disk space on the `{{ $labels.host }}` node for 5 minutes.
Raise condition	`(max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.75`
Description	Raises when the Elasticsearch instance uses 75% of disk space on the `log` node for 5 minutes. To verify the available and used disk space, run `df -h`.
Troubleshooting	Free or extend the disk space on the Elasticsearch partition. Decrease the default retention period for Elasticsearch as described in Configure Elasticsearch Curator.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, verify the available disk space on the `log` nodes and adjust the threshold according to the available space. Additionally, in the Prometheus Web UI, use the raise condition query to view the graph for a longer period of time and define the best threshold. For example, change the threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: ElasticsearchDiskWaterMarkMajor: if: >- (max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\ - max(elasticsearch_fs_total_available_in_bytes) by \ (host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\ by (host, instance) >= 0.9 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

ElasticsearchExporterNoDailyLogs¶

^{Available since 2019.2.10}

Severity	Warning
Summary	No new logs sent from a node within the last 3 hours.
Raise condition	`(sum by (host) (changes(logs_program_host_doc_count[3h])) or sum by (host) (up{host!=""})*0) == 0`
Description	Raises when no new logs were shipped from the `{{ $labels.host }}` node within the last 3 hours.
Troubleshooting	Verify that Fluentd is operating properly on the affected node.
Tuning	No tuning

updated: 2025-01-10 08:56

Alertmanager

View Previous Section

Heka

View Next Section