Elasticsearch
This section describes the alerts for the Elasticsearch service.
ElasticsearchClusterHealthStatusMajor
Severity |
Major |
Summary |
The Elasticsearch cluster status is YELLOW for 2 minutes. |
Raise condition |
elasticsearch_cluster_health_status == 2 |
Description |
Raises when the Elasticsearch cluster status is YELLOW for 2
minutes, meaning that Elasticsearch has allocated all of the primary
shards but some or all of the replicas have not been allocated. For the
exact reason, inspect the Elasticsearch logs on the log nodes in
/var/log/elasticsearch/elasticsearch.log . To verify the current
status of the cluster, run
curl -XGET '<host>:<port>/_cat/health?pretty' , where host is
elasticsearch:client:server:host and port is
elasticsearch:client:server:port defined in your model. |
Troubleshooting |
Verify the status of the shards using
curl -XGET '<host>:<port>/_cat/shards' . For details, see
Cluster Allocation Explain API.
If UNASSIGNED shards are present, reallocate the shards by running
the following command on the log nodes:
curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \
application/json' -d' { "persistent": \
{"cluster.routing.allocation.enable": "all" }}'
Manually reallocate the unassigned shards as described in
Cluster Reroute.
|
Tuning |
Not required |
ElasticsearchClusterHealthStatusCritical
Severity |
Critical |
Summary |
The Elasticsearch cluster status is RED for 2 minutes. |
Raise condition |
elasticsearch_cluster_health_status == 3 |
Description |
Raises when the Elasticsearch cluster status is RED for 2 minutes,
meaning that some or all of primary shards are not ready. For the exact
reason, inspect the Elasticsearch logs on the log nodes in
/var/log/elasticsearch/elasticsearch.log . To verify the current
status of the cluster, run
curl -XGET '<host>:<port>/_cat/health?pretty' , where host is
elasticsearch:client:server:host and port is
elasticsearch:client:server:port defined in your model. |
Troubleshooting |
Verify that the Elasticsearch service is running on all log nodes
using service elasticsearch status .
Verify the status of the shards using
curl -XGET '<host>:<port>/_cat/shards' . For details, see
Cluster Allocation Explain API.
Enable shard allocation by running the following command on the
log nodes:
curl -XPUT '<host>:<port>/_cluster/settings?pretty' -H 'Content-Type: \
application/json' -d' { "persistent": \
{"cluster.routing.allocation.enable": "all" }}'
Manually reallocate the unassigned shards as described in
Cluster Reroute.
For more troubleshooting details, see Official Elasticsearch
documentation.
|
Tuning |
Not required |
ElasticsearchServiceDown
Severity |
Minor |
Summary |
The Elasticsearch service on the {{ $labels.host }} node is down. |
Raise condition |
elasticsearch_up{host=~".*"} == 0 |
Description |
Raises when the Elasticsearch service is down on a log node. The
host label in the raised alert contains the host name of the
affected node. |
Troubleshooting |
- Verify the status of the service by running
systemctl status elasticsearch on the affected node.
- Inspect the Elasticsearch logs in
/var/log/elasticsearch/elasticsearch.log for the exact reason.
|
Tuning |
Not required |
ElasticsearchServiceDownMinor
Severity |
Minor |
Summary |
30% of Elasticsearch services are down for 2 minutes. |
Raise condition |
count(elasticsearch_up{host=~".*"} == 0) >=
count(elasticsearch_up{host=~".*"}) * 0.3 |
Description |
Raises when the Elasticsearch service is down on more than 30% of the
log nodes. By default, 3 log nodes are present, meaning that the
service is down on one node. |
Troubleshooting |
- Inspect the
ElasticsearchServiceDown alerts for the host names of
the affected nodes.
- Verify the Elasticsearch status by running the
systemctl status elasticsearch command on the affected node.
- Inspect the Elasticsearch logs in
/var/log/elasticsearch/elasticsearch.log for the exact reason.
|
Tuning |
Not required |
ElasticsearchServiceDownMajor
Severity |
Major |
Summary |
60% of Elasticsearch services are down for 2 minutes. |
Raise condition |
count(elasticsearch_up{host=~".*"} == 0) >=
count(elasticsearch_up{host=~".*"}) * 0.6 |
Description |
Raises when the Elasticsearch service is down on the more than 60% of
log nodes. By default, 3 log nodes are present, meaning that the
service is down on two nodes. |
Troubleshooting |
- Inspect the
ElasticsearchServiceDown alerts for the host names of
the affected nodes.
- Verify the Elasticsearch status by running the
systemctl status elasticsearch command on the affected node.
- Inspect the Elasticsearch logs in
/var/log/elasticsearch/elasticsearch.log for the exact reason.
|
Tuning |
Not required |
ElasticsearchServiceOutage
Severity |
Critical |
Summary |
All Elasticsearch services within the cluster are down. |
Raise condition |
count(elasticsearch_up{host=~".*"} == 0) ==
count(elasticsearch_up{host=~".*"}) |
Description |
Raises when the Elasticsearch service is down on all log nodes. |
Troubleshooting |
- Inspect the
ElasticsearchServiceDown alerts for the host names of
the affected nodes.
- Verify the Elasticsearch status by running the
systemctl status elasticsearch command on the affected node.
- Inspect the Elasticsearch logs in
/var/log/elasticsearch/elasticsearch.log for the exact reason.
|
Tuning |
Not required |
ElasticsearchDiskWaterMarkMinor
Severity |
Minor |
Summary |
The Elasticsearch {{ $labels.instance }} instance uses 60% of disk
space on the {{ $labels.host }} node for 5 minutes. |
Raise condition |
(max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max
by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max
by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.6 |
Description |
Raises when the Elasticsearch instance uses 60% of disk space on the
log node for 5 minutes. To verify the available and used disk
space, run df -h . |
Troubleshooting |
- Free or extend the disk space on the Elasticsearch partition.
- Decrease the default retention period for Elasticsearch as described
in Configure Elasticsearch Curator.
|
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, verify the available disk space on the log nodes
and adjust the threshold according to the available space. Additionally,
in the Prometheus Web UI, use the raise condition query to view the
graph for a longer period of time and define the best threshold.
For example, change the threshold to 80% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
ElasticsearchDiskWaterMarkMinor:
if: >-
(max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\
- max(elasticsearch_fs_total_available_in_bytes) by \
(host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\
by (host, instance) >= 0.8
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
ElasticsearchDiskWaterMarkMajor
Severity |
Major |
Summary |
The Elasticsearch {{ $labels.instance }} instance uses 75% of disk
space on the {{ $labels.host }} node for 5 minutes. |
Raise condition |
(max by(host, instance) (elasticsearch_fs_total_total_in_bytes) - max
by(host, instance) (elasticsearch_fs_total_available_in_bytes)) / max
by(host, instance) (elasticsearch_fs_total_total_in_bytes) >= 0.75 |
Description |
Raises when the Elasticsearch instance uses 75% of disk space on the
log node for 5 minutes. To verify the available and used disk
space, run df -h . |
Troubleshooting |
- Free or extend the disk space on the Elasticsearch partition.
- Decrease the default retention period for Elasticsearch as described
in Configure Elasticsearch Curator.
|
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, verify the available disk space on the log nodes
and adjust the threshold according to the available space. Additionally,
in the Prometheus Web UI, use the raise condition query to view the
graph for a longer period of time and define the best threshold.
For example, change the threshold to 90% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
ElasticsearchDiskWaterMarkMajor:
if: >-
(max(elasticsearch_fs_total_total_in_bytes) by (host, instance)\
- max(elasticsearch_fs_total_available_in_bytes) by \
(host, instance)) / max(elasticsearch_fs_total_total_in_bytes)\
by (host, instance) >= 0.9
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
ElasticsearchExporterNoDailyLogs
Available since 2019.2.10
Severity |
Warning |
Summary |
No new logs sent from a node within the last 3 hours. |
Raise condition |
(sum by (host) (changes(logs_program_host_doc_count[3h])) or sum by
(host) (up{host!=""})*0) == 0 |
Description |
Raises when no new logs were shipped from the {{ $labels.host }}
node within the last 3 hours. |
Troubleshooting |
Verify that Fluentd is operating properly on the affected node. |
Tuning |
No tuning |